Llm Evals Python Packages

agentdog

AgentDog helps developers inspect, test, score, and monitor AI agent runs locally.

4K 5 0

ai-stability

CLI-first LLM stability analyzer for measuring output consistency across repeated prompt runs.

421 0 0

dormant-behavior-audit

Benchmark assets, reproducibility tooling, and evidence checks for dormant behavior audit.

391 0 0

evalops

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

315 16 2