Prompt Evaluation Python Packages

compare-prompts

A lightweight CLI tool for evaluating LLM prompts. Run prompts side-by-side to instantly compare token usage, tone, reading level, and API costs. Natively supports OpenAI, Anthropic, Gemini, Groq, and Ollama. Perfect for safely refactoring prompts, tuning bot personalities, or proving that "be concise" actually lowers your bill.

5K 1 0

promptstats

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

494 107 2

judicator

Judging LLM-as-a-Judge — a screening tool for bias and miscalibration.

348 7 2

evalstats

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

293 107 2

prompt-foundry-python-sdk

The prompt engineering, prompt management, and prompt evaluation tool for Python

293 8 0