Ai Evaluation Python Packages

multivon-eval

Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI/CD integration.

9K 8 0

proofbundle

Receipts for AI eval results: turn a result into one signed, offline-verifiable JSON file. Proves who signed it and that nothing changed — not that it's true. Ed25519 + RFC 6962.

8K 1 0

uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

7K 1K 127

cje-eval

Causal Judge Evaluation: calibrate LLM-as-judge scores against oracle labels with valid uncertainty.

6K 43 4

falsiflow

Stop unverifiable AI eval, product metric, and R&D claims from passing CI.

4K 0 3

falsifyai

Portable, content-addressed reliability evidence for LLM systems. Capture how a model behaves under perturbation; preserve, verify, and diff the evidence across model changes.

3K 0 0

eval-ai-library

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

3K 42 3

openvals

Open-source AI model evaluation and benchmarking framework for LLMs (OpenAI, Ollama, Claude, Gemini)

2K 25 21

ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 22 3

pdfhell

PDF Hell — adversarial PDFs that break AI document readers. Procedural ground truth, not LLM-as-judge.

2K 0 0

falsify-inspect

PRML pre-registration adapter for Inspect AI eval logs. MIT.

1K 0 3

agent-release-gates

Release-readiness gates for AI agents: replay known incidents, apply policy-as-code, and produce ship/warn/block evidence before a prompt, model, or tool-policy change ships.

1K 3 0

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

1K 6 0

agent-action-guard

Runtime classifier for screening AI agent actions as safe, harmful, or unethical.

1K 10 7

ifixai

Catch your AI's mistakes and blind spots before your customers or regulators do. iFixAi runs 45 inspections, 32 graded core plus 13 extended for frontier risks like sabotage, sandbagging, and oversight evasion. It returns a letter grade in under 5 minutes. Industry and model agnostic.

1K 1K 155

cli-modelarium

Statistically rigorous LLM comparison CLI for terminal-first developers. Compare 10 cloud providers + local models side-by-side with bootstrap confidence intervals, significance testing (paired t-test, McNemar), LLM as judge, hallucination detection, and CI/CD assertions.

985 0 0

multivon-mcp

MCP server exposing multivon-eval + pdfhell as agent-callable tools. Drop into Claude Desktop, Cursor, Cline.

972 0 0

grandjury

GrandJury SDK — submit LLM traces for human evaluation + analytics client

701 2 1

nasde-toolkit

CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.

672 10 0

claim-memory-graph

Claim Memory Graph: a lightweight audit layer for inspectable LLM-as-a-judge decisions.

666 4 0

gaico

GenAI Results Comparator, GAICo, is a Python library to help compare, analyze and visualize outputs from Large Language Models (LLMs), often against a reference text. In doing so, one can use a range of extensible metrics from the literature.

594 10 2

promptstats

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

494 107 2

ai-stability

CLI-first LLM stability analyzer for measuring output consistency across repeated prompt runs.

421 0 0

judicator

Judging LLM-as-a-Judge — a screening tool for bias and miscalibration.

348 7 2