Llm Benchmarking Python Packages

elastik

AuditeDB: the db that listens. Python client for the Elastik L5 Engine.

4K 29 1

openvals

Open-source AI model evaluation and benchmarking framework for LLMs (OpenAI, Ollama, Claude, Gemini)

2K 25 21

awb

Benchmark harness measuring AI coding tool+workflow performance, not just model capability. 100 tasks, sigmoid scoring, 12 capability dimensions, gap analysis.

2K 10 5

agent-action-guard

🛡️ Safe AI Agents through Action Classifier

1K 10 7

porchbench

Rigorous quality benchmarking for local LLMs - paired stats, LLM-as-judge scoring, routing discovery, and reproducible runs on your hardware.

489 0 0

evalsense

Tools for systematic large language model evaluations

484 4 2

trusteval-ai

Enterprise LLM Evaluation & Responsible AI Framework — Benchmark bias, hallucination, PII leakage, and toxicity across Healthcare, BFSI, Retail & Legal industries. Supports OpenAI, Anthropic, Gemini & HuggingFace. Python SDK + CLI + Web Dashboard. 191 tests. Compliance-ready reports.

297 8 4