llm-benchmarking
curl is all you need. Bytes over HTTP, packets over CoAP, one path between intelligences. Ships with CurlBench, the first LLM tool-use benchmark graded by HTTP status codes
🛡️ Safe AI Agents through Action Classifier
Tools for systematic large language model evaluations
Enterprise LLM Evaluation & Responsible AI Framework — Benchmark bias, hallucination, PII leakage, and toxicity across Healthcare, BFSI, Retail & Legal industries. Supports OpenAI, Anthropic, Gemini & HuggingFace. Python SDK + CLI + Web Dashboard. 191 tests. Compliance-ready reports.