ai-benchmarks
Python SDK for HumanJudge — real human evaluations of AI models. 25,000+ blind reviews by 200+ verified reviewers across 58 models and 44 benchmarks. Free.
AI Action Firewall — seven-stage Decision Intelligence Core for safe agentic AI
Guardex - AI Control Plane for autonomous agents (closed source)
A benchmark that challenges language models to code solutions for scientific problems