Llm As Judge Python Packages

multivon-eval

Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI/CD integration.

9K 8 0

agentsynth-ai

Outcome-verified agent trajectories, benchmarks, and RL environments — with a live leaderboard and a CI gate for your agents. Offline-first, MIT.

6K 0 2

cje-eval

Causal Judge Evaluation: calibrate LLM-as-judge scores against oracle labels with valid uncertainty.

6K 43 4

prism-verify

Runtime LLM verifier — family-different, reasoning-stripped, multi-lens adjudication with signed Ed25519 receipts (CLI · MCP · HTTP).

3K 0 0

omegaprompt

The overfit gate for your prompts: re-test the winning prompt on held-out examples it never tuned on, and block the ship if it doesn't generalize. Sits on top of promptfoo/DSPy. CLI + MCP + CI.

3K 1 0

mini-omega-lock

Measure your LLM judge's noise floor — an A/B improvement smaller than the judge's own noise isn't real. Pre-flight probes for prompt-eval setups; works standalone and with omegaprompt.

2K 1 0

llm-evalgate

Eval gates with error bars: confidence intervals, calibrated LLM judges, and a statistically honest regression gate for LLM pipelines

2K 0 0

autosynth

Agentic synthetic-data generation framework inspired by Meta FAIR's Autodata / Agentic Self-Instruct.

1K 1 0

agent-release-gates

Release-readiness gates for AI agents: replay known incidents, apply policy-as-code, and produce ship/warn/block evidence before a prompt, model, or tool-policy change ships.

1K 3 0

proofrag

Point your agent at your docs and your RAG app; get a golden test set + an LLM-as-judge & retrieval scorecard, in one command.

1K 2 0

tracelens

Friendly evaluation and regression-testing framework for AI agents: inspectable traces, graded outcomes, baseline comparisons, and CI-ready reliability signals.

1K 1 2

pairjudge

Pairwise LLM judges (A/B/tie): budget-aware multi-turn packing, position-bias correction, pseudo-label distillation. Generalized from the 4th-place (gold) solution to Kaggle LMSYS Chatbot Arena.

856 169 11

distill-anything

Distill any AI model into one you own — a teacher (Claude/GPT/HF/Ollama) generates your dataset, a student trains on its logits or its words, a judge scores it blind, a benchmark prices it. Runs on a MacBook.

654 2 0

judgebias

Point it at your LLM judge + your judgments; get per-bias effect sizes with 95% CIs and concrete corrections.

610 0 0

ragaliq

LLM & RAG evaluation testing framework — hallucination detection, faithfulness metrics, answer relevance scoring, and retrieval pipeline testing with pytest integration

560 1 0

cigate

Eval-gated CI/CD for AI products: block a merge when answer quality statistically regresses, per failure-mode axis, with the LLM-judge's bias corrected for.

475 0 0

evidentry

Turn LLM eval runs into auditable evidence packs with defensible statistics

456 0 0

judgecal

Statistically rigorous, batch-first reliability auditing for LLM judges & reward models — clustered-bootstrap CIs, McNemar, BH-FDR, power/MDE, planted-bias validation. No API key required.

412 2 0

depth-lens

Cost-vs-accuracy CI for LLM ops. Pick the cheapest API tier, compare self-hosted vLLM vs cloud APIs on one Pareto, and grade free-form output with an LLM-as-judge - all on your own data with Wilson 95% CIs.

384 0 0

eval-harness

A boring, config-driven harness for evaluating AI systems. One YAML drives the run, the trace is the source of truth. Offline, backtesting, and online-eval modes — works with any agent, RAG, or code-modifying system.

381 0 0