agent-evals
The node-level tracing library for agentic software.
SWE-bench for your codebase. Turn merged PRs into reproducible coding-agent benchmarks.
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"