PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Ai Evaluation Python Packages

Python packages with the GitHub topic ai-evaluation. Sorted by relevance, with stars and monthly downloads.
cvs-health
uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

5K 1K 123
meshkovQA
eval-ai-library

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

4K 37 3
Pro-GenAI
agent-action-guard

🛡️ Safe AI Agents through Action Classifier

3K 10 7
Basaltlabs-app
gauntlet-cli

Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.

3K 6 0
HZYAI
ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 31 5
ankurpand3y
judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

1K 7 2
NoesisVision
nasde-toolkit

CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.

1K 9 0
humanjudge
grandjury

Python SDK for HumanJudge — real human evaluations of AI models. 25,000+ blind reviews by 200+ verified reviewers across 58 models and 44 benchmarks. Free.

1K 1 0
ai4society
gaico

A Python library providing evaluation metrics to compare generated texts from LLMs, often against reference texts. Features streamlined workflows for model comparison and visualization.

732 6 2
ianarawjo
evalstats

Statistically sane analysis methods for comparing AI model and prompt performance.

581 101 2
thegeekajay
workflowbench

Lightweight benchmark harness for AI-driven business workflows

538 1 0
buildwithabid
ai-stability

Measure LLM output consistency from the command line.

507 0 0
ianarawjo
promptstats

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

411 103 2
auraoneai
iaa-kit

Modern inter-annotator agreement metrics with bootstrap intervals, ordinal support, and missing-data handling.

408 0 0
auraoneai
eval-run-manifest

Portable manifest envelope for eval run provenance, artifacts, and reproducibility.

341 0 0
auraoneai
judge-bench

Bias probes and reproducible diagnostics for LLM-as-judge evaluation workflows.

340 0 0
auraoneai
judge-card

A disclosure format for judge prompts, calibration results, known bias, and recommended use envelopes.

340 0 0
auraoneai
eval-adapter

Adapters between rubric-spec and common evaluation framework inputs.

332 0 0
auraoneai
contamination-audit

Local contamination checks for eval data overlap, hashes, and n-gram leakage.

328 0 0
auraoneai
synthetic-disagreement

Synthetic reviewer disagreement generators for testing IAA and adjudication workflows.

318 0 0
haipad
aisert

Assert-style validation library for AI outputs - ensure your LLMs behave exactly as expected.

307 1 0
HemantBK
chatbot-auditor

Quality auditor for AI chatbots. Analyzes your conversation logs to show where the bot is underperforming.

209 0 0
RAILethicsHub
rail-score

DEPRECATED — use rail-score-sdk instead. This package redirects to rail-score-sdk.

182 2 1
sergeyklay
factly-eval

CLI tool to evaluate LLM factuality on MMLU benchmark.

99 2 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery