PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Rag Evaluation Python Packages

Python packages with the GitHub topic rag-evaluation. Sorted by relevance, with stars and monthly downloads.
agenta-ai
agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

48K 4K 520
Giskard-AI
giskard

🐢 Open-Source Evaluation & Testing library for LLM Agents

35K 5K 458
Marker-Inc-Korea
autorag

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

8K 5K 395
xmpuspus
kb-arena

Benchmark 7 retrieval strategies on your own docs — naive vector, contextual, QnA pairs, knowledge graph, RAPTOR, PageIndex, and hybrid. Find which KB architecture fits your data.

3K 7 2
HZYAI
ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 31 5
aiexponenthq
rag-benchmarking

RAG Benchmarking — Framework-agnostic RAG/agentic-AI evaluation harness. Faithfulness, agentic metrics, EU AI Act Article 15 accuracy evidence. Apache 2.0.

1K 0 0
LLAMATOR-Core
llamator

Red Teaming python-framework for testing chatbots and GenAI systems.

1K 211 20
Vbj1808
dokis

Lightweight RAG provenance middleware. Verifies every claim in an LLM response is grounded in a retrieved source - without an LLM call.

898 36 0
ENDEVSOLS
longprobe

Sub-second RAG regression testing. Define golden questions, detect lost chunks in CI. pytest for your RAG pipeline.

777 7 1
vectara
open-rag-eval

A Python package for RAG Evaluation

720 364 23
hallengray
rag-forge-core

Production-grade RAG pipelines with evaluation baked in

606 7 0
hallengray
rag-forge-evaluator

Production-grade RAG pipelines with evaluation baked in

577 7 0
hallengray
rag-forge-observability

Production-grade RAG pipelines with evaluation baked in

561 7 0
mts-ai
rurage

RURAGE (Robust Universal RAG Evaluation) is a Python library developed to speed-up evaluation of RAG systems with Correctness, Faithfulness and Relevance axes using a variety of deterministic and model-based metrics.

508 34 0
mburaksayici
smallevals

smallevals — CPU-fast, GPU-blazing fast offline retrieval evaluation for RAG systems with tiny QA models.

300 18 2
vero-labs-ai
vero-eval

Open source framework for evaluating AI Agents

209 29 2
RAILethicsHub
rail-score

DEPRECATED — use rail-score-sdk instead. This package redirects to rail-score-sdk.

166 2 1
shaadclt
eval-rag

A comprehensive evaluation toolkit for assessing Retrieval-Augmented Generation (RAG) outputs using linguistic, semantic, and fairness metrics

137 4 0
syncreus
syncreus-eval

Evaluate your LLM apps with one function call. Hallucination detection, RAG scoring, and agent evals for OpenAI, Anthropic, and more. 14 evaluators, pytest plugin, composite trust scores.

136 2 0
dariero
ragaliq

LLM & RAG evaluation testing framework — hallucination detection, faithfulness metrics, answer relevance scoring, and retrieval pipeline testing with pytest integration

84 1 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery