Evaluation Framework Python Packages

deepeval

The LLM Evaluation Framework

7.6M 17K 2K

lm-eval

A framework for few-shot evaluation of language models.

1.7M 13K 3K

lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

20K 2K 501

scandeval

The robust European language model benchmark.

10K 187 60

irspack

Train, evaluate, and optimize implicit feedback-based recommender systems.

9K 31 10

cje-eval

Causal Judge Evaluation: calibrate LLM-as-judge scores against oracle labels with valid uncertainty.

6K 43 4

euroeval

The robust European language model benchmark.

6K 187 60

letta-evals

Evaluation kit for testing stateful agents

4K 74 11

kiln-ai

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

4K 5K 375

pyrddlgym

A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.

3K 93 23

retrieval-observatory

pypi retrieval reliability platform for rag pipelines: per-stage benchmarks, failure labels, stress tests, prod traces, and regression gates.

3K 0 0

python-flexeval

FlexEval is an LLM evaluation tool designed for practical quantitative analysis.

2K 16 0

agentlab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

2K 598 121

aiverify-moonshot

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

2K 334 65

ragret

Lightweight evaluation framework for Retrieval Augmented Generation systems, focused on simplicity and long-term consistency.

2K 6 0

kiln-server

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

2K 5K 375

agent-belt

Reproducible evaluation for AI coding agents. Multi-turn scenarios against Claude Code, Codex, Copilot, Cursor, Gemini CLI, Goose, OpenCode, or any custom agent you plug in; verify behavior with rule checks, workspace diffs, multi-judge LLM consensus; pin reliability with pass^k variance across trials. Git worktrees, optional Docker sandbox.

2K 16 1

gval

A high-level Python framework to evaluate the skill of geospatial datasets by comparing candidates to benchmark maps producing agreement maps and metrics.

1K 26 4

kaiko-eva

Evaluation framework for oncology foundation models (FMs)

1K 160 38

modelradar

Aspect-based Forecasting Accuracy

1K 5 2

hnep

Hybrid Network Evaluation Protocol — multi-method evaluation for hybrid quantum-classical ML models. Classifies your quantum component as Genuine, Regularizer, Ignored, or Dead Weight with bootstrap confidence intervals.

1K 0 0

clean-evals

Try out your prompts and context across AI models. Run evals and find the best model for your use case.

1K 2 0

zenoml-image-classification

AI Data Management & Evaluation Platform

1K 214 11

aiterate

AI artifact lifecycle management for prompts and agent skills: optimize, evaluate, version, trace, and promote changes from raw data and policies.

886 1 0