PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Evaluation Framework Python Packages

Python packages with the GitHub topic evaluation-framework. Sorted by relevance, with stars and monthly downloads.
confident-ai
deepeval

The LLM Evaluation Framework

3.3M 15K 1K
EleutherAI
lm-eval

A framework for few-shot evaluation of language models.

1.6M 13K 3K
huggingface
lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

42K 2K 463
EuroEval
scandeval

The robust European language model benchmark.

13K 176 57
tohtsky
irspack

Train, evaluate, and optimize implicit feedback-based recommender systems.

11K 31 10
EuroEval
euroeval

The robust European language model benchmark.

5K 176 57
Kiln-AI
kiln-ai

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

4K 5K 366
letta-ai
letta-evals

Evaluation kit for testing stateful agents

3K 70 9
ServiceNow
agentlab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

3K 579 111
aiverify-foundation
aiverify-moonshot

AI Verify advances Gen AI testing with Project Moonshot.

2K 322 62
Kiln-AI
kiln-server

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

1K 5K 366
kaiko-ai
kaiko-eva

Evaluation framework for oncology foundation models (FMs)

1K 156 38
pyrddlgym-project
pyrddlgym

A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.

1K 93 23
zeno-ml
zenoml

Interactive Evaluation Framework for Machine Learning

1K 214 11
Khanz9664
trustlens

Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.

1K 12 12
ankurpand3y
judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

1K 7 2
jsell-rh
k-eval

Simple context-aware evaluation framework for AI agents using MCP.

1K 2 0
vcerqueira
modelradar

Aspect-based Forecasting Accuracy

1K 5 2
b-bayrak
ceval

A framework for evaluating counterfactual explanations.

970 0 0
cmudig
zenoml-image-classification

AI Data Management & Evaluation Platform

887 214 11
NOAA-OWP
gval

A high-level Python framework to evaluate the skill of geospatial datasets by comparing candidates to benchmark maps producing agreement maps and metrics.

873 25 4
gmitt98
fieldtest

LLM evaluation framework — define what correct, well-formed, and safe means before you measure

869 0 0
lapix-ufsc
lapixdl

Python package with Deep Learning utilities for Computer Vision

798 9 3
davidheineman
nlproc-tools

🌾 Universal, customizable and deployable fine-grained evaluation for text generation.

677 24 5
    • Data from PyPI, GitHub, ClickHouse, and BigQuery