Evaluation Python Packages

langsmith

LangSmith Client SDK Implementations

100.2M 952 254

mlflow-skinny

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

42.4M 27K 6K

mlflow

39.2M 27K 6K

mlflow-tracing

20.9M 27K 6K

simpleeval

Simple Safe Sandboxed Extensible Expression Evaluator for Python

15.2M 605 97

evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

5.6M 2K 326

opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

3.8M 20K 2K

mteb

MTEB: State-of-the-art evaluation of embeddings across languages and modalities

3.1M 3K 632

ragas

Supercharge Your LLM Application Evaluations 🚀

1.6M 15K 2K

faster-coco-eval

Continuation of an abandoned project fast-coco-eval

572K 145 12

evo

Python package for the evaluation of odometry and SLAM

240K 4K 792

ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

174K 686 32

unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

160K 216 66

pytrec-eval

pytrec_eval is an Information Retrieval evaluation tool for Python, based on the popular trec_eval.

107K 349 36

evalidate

Safe and fast evaluation of untrusted user-supplied python expressions

76K 41 5

vincio

The context engineering platform for AI applications — compile prompts, memory, retrieval, tools, schemas & policies into optimized, validated, observable context packets.

47K 2 0

agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

41K 4K 559

evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

40K 3K 414

opik-optimizer

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

31K 20K 2K