Llm As A Judge Python Packages

agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

41K 4K 559

llm-council-core

The LLM Council works together to answer your hardest questions

8K 32 11

ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 22 3

vllm-judge

A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.

1K 2 2

dingo-python

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

1K 720 74

mcp-as-a-judge

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations

1K 17 9

openevalkit

Production-grade Python framework for evaluating LLM and agentic systems with traditional scorers, LLM judges (OpenAI, Anthropic, Ollama, 100+ models via LiteLLM), ensemble aggregation, and smart caching for cost-effective testing.

1K 3 0

clean-evals

Try out your prompts and context across AI models. Run evals and find the best model for your use case.

1K 2 0

verdict

Inference-time scaling for LLMs-as-a-judge.

996 345 28

scorable

Scorable SDK

857 14 1

evret

Evals framework for Information Retrieval Systems

745 18 3

docling-sdg

A set of tools to create synthetically-generated data from documents

742 48 20

nasde-toolkit

CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.

672 10 0

claim-memory-graph

A memory layer that tracks evidence, claims, and decisions to make multi-turn LLM judges and reviewer agents more inspectable and stable.

666 4 0

root-signals

The Python SDK for API of Root Signals

592 14 1

metajudge

A reliability and DIF report card for LLM-judge and human-rater scoring instruments.

573 1 0

llm-summary

Use an LLM to summarize paragraphs.

563 0 0

pytest-llm-rubric

Pytest plugin for semantic PASS/FAIL checks using LLM-as-a-Judge

536 0 0

veritail

LLM-as-a-Judge evaluation platform for ecommerce search. Scores relevance, computes IR metrics, and flags quality issues across multiple retail verticals

498 6 1