PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Model Evaluation Python Packages

Python packages with the GitHub topic model-evaluation. Sorted by relevance, with stars and monthly downloads.
debiai
debiai-gui

DebiAI easy start module, the standalone version of DebiAI

3K 30 5
Basaltlabs-app
gauntlet-cli

Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.

3K 6 0
studio-11-co
falsify

A single-file Python CLI that pre-registers AI/ML accuracy claims with SHA-256. Lock the threshold before the data, or it didn't happen.

2K - -
metno
pyaerocom

pyaerocom model evaluation software

1K 31 15
star-whale
starwhale-bootstrap

an MLOps/LLMOps platform

1K 238 39
Khanz9664
trustlens

Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.

1K 12 12
ankurpand3y
judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

1K 7 2
star-whale
starwhale

an MLOps/LLMOps platform

1K 238 39
evaluation-context-protocol
ecp-sdk

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

1K 8 1
evaluation-context-protocol
ecp-runtime

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

1K 8 1
humanjudge
grandjury

Python SDK for HumanJudge — real human evaluations of AI models. 25,000+ blind reviews by 200+ verified reviewers across 58 models and 44 benchmarks. Free.

1K 1 0
studio-11-co
mlflow-falsify

A single-file Python CLI that pre-registers AI/ML accuracy claims with SHA-256. Lock the threshold before the data, or it didn't happen.

984 5 2
Padraigobrien08
model-failure-lab

Toolkit for discovering, classifying, and debugging failure modes in LLM and RAG systems.

780 0 0
stef41
modeldiffx

Behavioral regression testing for LLMs. Capture outputs, diff behavior, detect drift — pytest for model upgrades.

735 1 0
wmjg-alt
easymlselector

A model selection process for Machine Learning tasks on subset of training sample

591 0 0
burning-cost
insurance-cv

Temporal cross-validation for insurance pricing - respects policy time structure, CatBoost, Polars

378 0 0
Rowusuduah
llm-sentry

Unified AI Reliability Platform. One install, 12 diagnostic engines. Zero-dependency LLM pipeline monitoring.

359 0 0
animator
titus2

Titus 2 : Portable Format for Analytics (PFA) implementation for Python 3.4+

334 24 2
vAndrewKarma
data-science-snippets

🧰 Essential EDA and Data Cleaning Helpers for Any DataFrame This collection of functions is designed to accelerate exploratory data analysis (EDA), quickly surface data quality issues, and offer high-level insights into the structure and content of your dataset.

321 3 2
Ricardouchub
evalcards

Librería Python para generar reportes de evaluación (clasificación, regresión, forecasting) con métricas y gráficos listos en Markdown, JSON y pronto HTML.

318 1 0
metriculous-ml
metriculous

Measure and visualize machine learning model performance without the usual boilerplate.

255 98 11
medoidai
skrobot

skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.

231 24 2
Lantianzz
scorecardbundle

A High-level Scorecard Modeling API | 评分卡建模尽在于此

226 83 30
daniel-yj-yang
machlearn

A Simple Yet Powerful Machine Learning Python Library

186 1 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery