llm-evals
Measure LLM output consistency from the command line.
Benchmark assets, reproducibility tooling, and evidence checks for dormant behavior audit.
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"