Llm Evaluation Toolkit Python Packages

qa-metrics

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

6K 61 6

parea-ai

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

3K 82 11

langtest

Deliver safe & effective language models

3K 562 50

nlptest

Deliver safe & effective language models

3K 562 50

scalexi

scalexi is a versatile open-source Python library, optimized for Python 3.11+, focuses on facilitating low-code development and fine-tuning of diverse Large Language Models (LLMs).

951 13 2

evalsense

Tools for systematic large language model evaluations

484 4 2

voigt-kampff

Voigt-Kampff — a behavioral safety scoring engine for AI systems. Part of the SAPIEN Framework.

168 2 0