Agent Evals Python Packages

agentloss

Measure the real-world error rate and dollar cost of an AI agent's decisions. OpenTelemetry-native.

4K 1 0

nodetracer

The node-level tracing library for agentic software.

454 1 1

evalops

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

315 16 2

repoagentbench

SWE-bench for your codebase. Turn merged PRs into reproducible coding-agent benchmarks.

160 32 0