Agent Benchmark Python Packages

evalview

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

4K 120 21

tracecore

Deterministic runtime for agent evaluation

763 8 0

nasde-toolkit

CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.

672 10 0

dspy-security-bench

Measure how DSPy prompt optimization affects the prompt-injection robustness of agentic LLM programs, using AgentDojo's attack suite.

442 4 0

codejoust

A CLI arena for AI coding agents. Throw one bug at Claude Code, Codex, aider — let them race, auto-score, and pick the winner.

230 8 0