prompt-evaluation
Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.
Statistically sane analysis methods for comparing AI model and prompt performance.
Statistical analysis methods for comparing prompt and model performance in LLM evaluations.
The prompt engineering, prompt management, and prompt evaluation tool for Python