PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Evaluation Metrics Python Packages

Python packages with the GitHub topic evaluation-metrics. Sorted by relevance, with stars and monthly downloads.
confident-ai
deepeval

The LLM Evaluation Framework

3.3M 15K 1K
AgentOps-AI
agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

748K 6K 578
MiXaiLL76
faster-coco-eval

Continuation of an abandoned project fast-coco-eval

561K 141 11
Unbabel
unbabel-comet

A Neural Framework for MT Evaluation

281K 752 110
AmenRa
ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

121K 677 31
ibm
unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

77K 212 67
huggingface
lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

42K 2K 463
MantisAI
nervaluate

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

41K 215 27
fakufaku
fast-bss-eval

A fast implementation of bss_eval metrics for blind source separation

21K 146 9
google-research
rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

21K 872 49
songweige
cd-fvd

[CVPR 2024] On the Content Bias in Fréchet Video Distance

20K 147 7
thieu1995
permetrics

Artificial intelligence (AI, ML, DL) performance metrics implemented in Python

19K 91 22
noutenki
pyrouge

A Python wrapper for the ROUGE summarization evaluation package

18K 249 72
k4black
codebleu

Pip compatible CodeBLEU metric implementation available for linux/macos/win

9K 136 29
erdogant
classeval

Evaluation of supervised predictions for two-class and multi-class classifiers

6K 8 2
proycon
pynlpl

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

6K 476 66
clovaai
prdc

Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.

5K 272 28
kqf
ir-metrics

The most common information retrieval (IR) metrics

4K 5 0
MIND-LAB
octis

OCTIS: a library for Optimizing and Comparing Topic Models.

3K 800 118
GiulioRossetti
nf1

A novel approach to evaluate community detection algorithms on ground truth

2K 20 9
shi-ang
survivaleval

The most comprehensive Python package for evaluating survival analysis models.

2K 51 7
hyeonsangjeon
nlptutti

STT 한글 문장 인식기 출력 스크립트의 외자 오류율(CER), 단어 오류율(WER)을 계산하는 Python 함수 패키지

2K 71 11
kaivid-labs
evret

Evals framework for Information Retrieval Systems

1K - -
evaluation-context-protocol
ecp-sdk

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

1K 8 1
    • Data from PyPI, GitHub, ClickHouse, and BigQuery