Human Evaluation Python Packages

subset2evaluate

Find informative examples to efficiently (human)-evaluate NLG models.

19K 17 3

pairadigm

Concept-Guided Chain-of-Thought (CGCoT) pairwise annotation tool for systematic text evaluation using LLMs. Generate breakdowns, compare items, compute scores, and validate against human judgments. Supports Ollama, Hugging Face, Google Gemini, OpenAI, and Anthropic models.

368 8 0