PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Evals Python Packages

Python packages with the GitHub topic evals. Sorted by relevance, with stars and monthly downloads.
pydantic
logfire

AI observability platform for production LLM and agent systems.

21.9M 4K 236
Arize-ai
arize-phoenix

AI Observability & Evaluation

2.4M 10K 882
Arize-ai
arize-phoenix-otel

AI Observability & Evaluation

1.8M 10K 882
Arize-ai
arize-phoenix-client

AI Observability & Evaluation

952K 10K 882
Arize-ai
arize-phoenix-evals

AI Observability & Evaluation

765K 10K 882
AgentOps-AI
agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

748K 6K 578
truera
trulens-core

Evaluation and Tracking for LLM Experiments and AI Agents

147K 3K 280
truera
trulens

Evaluation and Tracking for LLM Experiments and AI Agents

123K 3K 280
harbor-framework
harbor-rewardkit

Harbor is a framework for running agent evaluations and creating and using RL environments.

117K 2K 1K
truera
trulens-providers-litellm

Evaluation and Tracking for LLM Experiments and AI Agents

73K 3K 280
truera
trulens-dashboard

Evaluation and Tracking for LLM Experiments and AI Agents

62K 3K 280
truera
trulens-feedback

Evaluation and Tracking for LLM Experiments and AI Agents

61K 3K 280
truera
trulens-otel-semconv

Evaluation and Tracking for LLM Experiments and AI Agents

60K 3K 280
truera
trulens-eval

Evaluation and Tracking for LLM Experiments and AI Agents

53K 3K 280
truera
trulens-connectors-snowflake

Evaluation and Tracking for LLM Experiments and AI Agents

35K 3K 280
manav8498
shadow-diff

Behavior contracts for AI agents

31K 9 0
dustalov
evalica

Evalica, your favourite evaluation toolkit

21K 62 5
truera
trulens-providers-cortex

Evaluation and Tracking for LLM Experiments and AI Agents

21K 3K 280
truera
trulens-providers-openai

Evaluation and Tracking for LLM Experiments and AI Agents

16K 3K 280
truera
trulens-apps-langchain

Evaluation and Tracking for LLM Experiments and AI Agents

14K 3K 280
blackwell-systems
mcp-assert

The deterministic testing standard for MCP servers. Connect over real stdio/SSE/HTTP transport, call tools with real arguments, assert results with 18 assertion types defined in YAML. Any language, any transport, no mocks. Single Go binary.

10K 8 1
truera
trulens-apps-llamaindex

Evaluation and Tracking for LLM Experiments and AI Agents

10K 3K 280
ben-ranford
cellin

Build long-lived multimodal memory, dream over it, and retrieve context with transparent weighting.

7K 0 0
johnnichev
selectools

Production-ready Python framework for AI agents with built-in guardrails, audit logging, cost tracking, and hybrid RAG. Supports OpenAI, Anthropic, Gemini, Ollama. By NichevLabs.

6K 9 2
    • Data from PyPI, GitHub, ClickHouse, and BigQuery