agent-benchmark
Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.
CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.
Pit AI coding agents against the same bug. Score them on tests, diff, cost, and time — pick the winning patch.
Deterministic runtime for agent evaluation