Ai Evaluation Tools Python Packages

eval-ai-library

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

3K 42 3

agentneo

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

2K 16K 4K

promptstats

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

494 107 2

quantiles

Open-source, local-first eval infrastructure for benchmarking AI systems, inspecting runs, comparing results, and debugging regressions with a CLI, SDKs, and coding-agent workflows.

327 0 0

evalstats

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

293 107 2