Evals Python Packages

logfire

AI observability platform for production LLM and agent systems.

19M 4K 253

harbor-rewardkit

Framework for evaluating and improving agents

2.9M 3K 1K

arize-phoenix

AI Observability & Evaluation

2.2M 10K 956

arize-phoenix-otel

AI Observability & Evaluation

1.8M 10K 956

arize-phoenix-client

AI Observability & Evaluation

977K 10K 956

arize-phoenix-evals

AI Observability & Evaluation

750K 10K 956

agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

318K 6K 604

trulens-core

Evaluation and Tracking for LLM Experiments and AI Agents

69K 3K 309

hud-python

RL environments + evals for AI agents. Define once, train anything.

56K 271 61

trulens-dashboard

Evaluation and Tracking for LLM Experiments and AI Agents

47K 3K 309

trulens-eval

Evaluation and Tracking for LLM Experiments and AI Agents

42K 3K 309

trulens-feedback

Evaluation and Tracking for LLM Experiments and AI Agents

40K 3K 309

trulens-otel-semconv

Evaluation and Tracking for LLM Experiments and AI Agents

39K 3K 309

trulens

Evaluation and Tracking for LLM Experiments and AI Agents

37K 3K 309

verel

Verified agents — nothing is "done" until a grader returns a verdict. Eyes (AgentVision) + verdict bus + compounding memory + a fleet + agent-built tooling + agent-run CI/CD.

22K 4 2

evalica

Evalica, your favourite evaluation toolkit

19K 64 5

trulens-connectors-snowflake

Evaluation and Tracking for LLM Experiments and AI Agents

17K 3K 309

trulens-providers-openai

Evaluation and Tracking for LLM Experiments and AI Agents

16K 3K 309

selectools

Production-ready Python framework for AI agents with built-in guardrails, audit logging, cost tracking, and hybrid RAG. Supports OpenAI, Anthropic, Gemini, Ollama. By NichevLabs.

12K 10 3

brein-mcp

🧠 it remembers what your company forgets - an brain that measures itself and patches its own gaps.

12K 2 0

trulens-apps-langchain

Evaluation and Tracking for LLM Experiments and AI Agents

12K 3K 309

trulens-providers-litellm

Evaluation and Tracking for LLM Experiments and AI Agents

11K 3K 309

trulens-providers-cortex

Evaluation and Tracking for LLM Experiments and AI Agents

10K 3K 309

multivon-eval

Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI/CD integration.

9K 8 0