Llm Benchmark Python Packages

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

1K 6 0

workswithagents

Works With Agents — Agent OSI Model Python SDK

994 0 0

context-bench

Benchmark any system that transforms LLM context: compressors, RAG rerankers, memory managers, and more.

293 0 0

arguslm

The hundred-eyed watcher for your LLM providers. Monitor uptime, TTFT, TPS, and latency across OpenAI, Anthropic, Azure, Bedrock, Ollama, LM Studio, and 100+ providers through a single dashboard. Benchmark, compare, and get alerts — all self-hosted.

282 1 1

pickyourllm

Pick Your LLM: Intelligent, Use-Case Aware LLM Model advisor for Optimal Performance and Cost

233 1 0