PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Llm As A Judge Python Packages

Python packages with the GitHub topic llm-as-a-judge. Sorted by relevance, with stars and monthly downloads.
agenta-ai
agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

51K 4K 520
amiable-dev
llm-council-core

The LLM Council works together to answer your hardest questions

5K 23 7
HZYAI
ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 31 5
haizelabs
verdict

Inference-time scaling for LLMs-as-a-judge.

2K 339 26
MigoXLab
dingo-python

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

1K 700 72
kaivid-labs
evret

Evals framework for Information Retrieval Systems

1K - -
ankurpand3y
judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

1K 7 2
root-signals
scorable

The Python SDK for API of Scorable

1K 14 1
NoesisVision
nasde-toolkit

CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.

1K 9 0
egerpaulj
llm-summary

Use an LLM to summarize paragraphs

768 0 0
root-signals
root-signals

Scorable SDK

627 14 1
docling-project
docling-sdg

A set of tools to create synthetically-generated data from documents

578 48 18
OtherVibes
mcp-as-a-judge

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations

577 17 9
trustyai-explainability
vllm-judge

A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.

565 2 2
ugai
pytest-llm-rubric

Pytest plugin for semantic PASS/FAIL checks using LLM-as-a-Judge

519 0 0
yonahgraphics
openevalkit

Open evaluation kit for LLM systems

427 3 0
asarnaout
veritail

LLM-as-a-Judge evaluation platform for ecommerce search. Scores relevance, computes IR metrics, and flags quality issues across multiple retail verticals

393 6 1
IAAR-Shanghai
xfinder

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

304 178 7
bassrehab
artemis-agents

Production-ready multi-agent debate framework with adaptive evaluation and safety monitoring

272 0 0
rafaelsandroni
llm-antibodies

Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)

196 0 0
root-signals
root-signals-cli

CLI for the Root Signals API

153 14 1
rafaelsandroni
antibodies-rafaelsandroni

Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)

133 0 0
root-signals
scorable-cli

CLI for the Scorable API

132 14 1
OtherVibes
iflow-mcp-hepivax-mcp-as-a-judge

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations

111 17 9
    • Data from PyPI, GitHub, ClickHouse, and BigQuery