Evaluation Metrics Python Packages

deepeval

The LLM Evaluation Framework

7.6M 17K 2K

faster-coco-eval

Continuation of an abandoned project fast-coco-eval

572K 145 12

agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

318K 6K 604

ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

174K 686 32

unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

160K 216 66

unbabel-comet

A Neural Framework for MT Evaluation

72K 766 111

nervaluate

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

54K 220 27

permetrics

Artificial intelligence (AI, ML, DL) performance metrics implemented in Python

23K 91 22

fast-bss-eval

A fast implementation of bss_eval metrics for blind source separation

22K 149 10

cd-fvd

[CVPR 2024] On the Content Bias in Fréchet Video Distance

22K 147 7

prdc

Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.

20K 272 28

lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

20K 2K 501

pyrouge

A Python wrapper for the ROUGE summarization evaluation package

13K 249 72

codebleu

Pip compatible CodeBLEU metric implementation available for linux/macos/win

12K 139 27

rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

8K 873 49

octis

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

5K 803 117

pynlpl

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

5K 475 66

classeval

Evaluation of supervised predictions for two-class and multi-class classifiers

5K 8 2

ir-metrics

The most common information retrieval (IR) metrics

4K 5 0

survivaleval

The most comprehensive Python package for evaluating survival analysis models.

3K 52 7

nlptutti

STT 한글 문장 인식기 출력 스크립트의 외자 오류율(CER), 단어 오류율(WER)을 계산하는 Python 함수 패키지

3K 72 11

ecp-runtime

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

2K 8 1

ecp-sdk

2K 8 1

rag-eval-lite

Lightweight python package for RAGs classical Information Retrieval (IR) metrics.

2K 0 0