Rag Evaluation Python Packages

agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

41K 4K 559

giskard

🐢 Open-Source Evaluation & Testing library for LLM Agents

32K 5K 479

multivon-eval

Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI/CD integration.

9K 8 0

autorag

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

8K 5K 406

contexttrace

Local-first SDK and CLI for debugging RAG failures, verifying citations, classifying failure modes, and generating reliability reports for user-built RAG and AI agent systems.

4K 0 0

kb-arena

Benchmark 9 retrieval architectures (vector, contextual, QnA, knowledge graph, hybrid, RAPTOR, PageIndex, BM25, rerank) on your own docs. Automated hyperparameter search with bootstrap CIs and significance tests.

3K 7 3

ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 22 3

mcpaisuite-evalmcp

Evaluation for AI agents — judge-based scoring and native RAG metrics (faithfulness, relevancy, context precision/recall). Python lib · CLI · MCP server · FastAPI.

2K 0 0

proofrag

Point your agent at your docs and your RAG app; get a golden test set + an LLM-as-judge & retrieval scorecard, in one command.

1K 2 0

llamator

Framework for testing vulnerabilities of GenAI systems.

997 216 21

veralith

Hallucination diagnosis for RAG systems — Sufficiency, Faithfulness, Completeness verdicts plus rule-based remediation.

766 2 0

open-rag-eval

RAG evaluation without the need for "golden answers"

764 381 23

ragaliq

LLM & RAG evaluation testing framework — hallucination detection, faithfulness metrics, answer relevance scoring, and retrieval pipeline testing with pytest integration

560 1 0

dokis

Trust reporting and provenance enforcement for RAG pipelines. Audits claim grounding, source allowlists, and trust failures without an LLM call.

463 37 0

ragverdict

pytest for RAG agents — behavioral audits with PASS/FAIL/WEAK verdicts

450 0 0

hitgate

Portable hybrid (BM25 + dense + RRF) retrieval engine and a label-free evaluation harness — extracted from a personal AI-assistant memory index and decoupled to run on any source tree.

442 2 0

rag-forge-evaluator

Evaluation engine: RAGAS, DeepEval, LLM-as-Judge, and audit report generation

429 7 0

rag-forge-core

RAG pipeline primitives: ingestion, retrieval, context management, and security

423 7 0

smallevals

smallevals — CPU-fast, GPU-blazing fast offline retrieval evaluation for RAG systems with tiny QA models.

401 20 2

rag-forge-observability

Observability stack: OpenTelemetry tracing, Langfuse integration, and drift detection

386 7 0

rurage

RURAGE (Robust Universal RAG Evaluation) is a Python library developed to speed-up evaluation of RAG systems with Correctness, Faithfulness and Relevance axes using a variety of deterministic and model-based metrics.

339 34 0

vero-eval

The End-to-End LLM Evaluation Framework

292 31 2

rag-benchmarking

Framework-agnostic evaluation harness for RAG and agentic AI systems

286 0 0

longprobe

Sub-second RAG regression testing. Define golden questions, detect lost chunks in CI. pytest for your RAG pipeline.

278 7 2