Model Evaluation Python Packages

canary-ml

Drift and anomaly detection for production ML models

4K 2 0

falsify

A single-file Python CLI that pre-registers AI/ML accuracy claims with SHA-256. Lock the threshold before the data, or it didn't happen.

4K 7 3

ml4t-diagnostic

Signal diagnostics, statistical validation, and backtest evaluation for quantitative trading workflows.

3K 20 12

ecp-runtime

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

2K 8 1

ecp-sdk

2K 8 1

pyaerocom

Python tools for climate and air quality model evaluation

2K 31 15

falsify-inspect

PRML pre-registration adapter for Inspect AI eval logs. MIT.

1K 0 3

debiai-gui

DebiAI easy start module, the standalone version of DebiAI

1K 30 4

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

1K 6 0

mlflow-falsify

A single-file Python CLI that pre-registers AI/ML accuracy claims with SHA-256. Lock the threshold before the data, or it didn't happen.

1K 7 3

arxiv-embedding-benchmark

Published PyPI package for ArXiv embedding benchmarks, retrieval evaluation, and scientific RAG experiments.

969 4 0

starwhale

An MLOps Platform for Model Evaluation

949 237 38

evalcards

Librería Python para generar reportes de evaluación (clasificación, regresión, forecasting) con métricas y gráficos listos en Markdown, JSON y pronto HTML.

881 1 0

oncothresh

Clinical threshold evaluation for oncology AI models

850 0 0

trustlens

Audit ML models beyond accuracy — calibration, fairness, latent health, and deployment verdicts.

829 12 20

starwhale-bootstrap

MLOps Platform

800 237 38

modeldiffx

Model behavioral diffing - compare LLM outputs across versions, detect regressions.

775 1 0

grandjury

GrandJury SDK — submit LLM traces for human evaluation + analytics client

701 2 1

hermia

Vendor-agnostic security evals for LLM inference stacks. Detects behavioral divergence across models, backends, and hardware.

542 6 1

model-failure-lab

Toolkit for systematically probing, classifying, and debugging failure modes in LLM and RAG systems — reasoning errors, hallucination, and API-level behaviour.

509 0 0