Llm Testing Python Packages

proofagent-harness

Open-source test harness for AI agents. Stress-test production agents with adversarial multi-turn scenarios in CI

3K 6 2

langtest

Deliver safe & effective language models

3K 562 50

nlptest

Deliver safe & effective language models

3K 562 50

evalcraft

Generate deterministic pytest tests for your AI agents from one real run, then replay them in CI for $0. Fast, flake-free agent testing.

2K 4 2

fakellm

Mock OpenAI and Anthropic APIs locally for fast, deterministic tests.

2K 3 0

agentneo

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

2K 16K 4K

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

1K 6 0

llamator

Framework for testing vulnerabilities of GenAI systems.

997 216 21

temper-skills

Test suites + deterministic Python for your agent skills' decision logic. Adversarial persona reviewers write the tests; the generated code must keep passing them — zero LLM calls at inference. Try: uvx temper-skills audit <skill.md>

974 0 0

ai-safety-tester

LLM Security Testing Framework | CVE-style severity scoring, prompt injection detection, bias analysis | Strict mypy typing | Fast CI/CD with pytest markers | PyPI: ai-safety-tester

929 0 0

llm-behave

Behavioral testing for LLM applications. pytest plugin with semantic assertions, multi-turn conversation testing, and drift detection. No LLM judge needed.

818 1 0

agent-eval-runner

5-case free starter for the AI Agent QA Eval Pack. Vendor-agnostic YAML eval cases for tool-using LLM agents (LangChain / OpenAI / Anthropic / custom). CC BY 4.0.

787 0 0

agentassay

Formal regression testing for non-deterministic AI agent workflows

722 5 1

api-test-ninja

API Ninja simplifies API testing by allowing users to define test flows in plain English.

722 2 1

infer-check

Correctness and reliability testing for LLM inference engines

653 2 0

recalllab

pytest for agent memory: turn memory expectations into pytest contracts that run in CI.

575 0 0

ragaliq

LLM & RAG evaluation testing framework — hallucination detection, faithfulness metrics, answer relevance scoring, and retrieval pipeline testing with pytest integration

560 1 0

tinyqabenchmarkpp

A tiny synthetic QA LLM benchmark dataset generator using LiteLLM.

539 16 0

replayd

Turn failed AI agent runs into replayable regression tests

487 17 2

ccheck

A human-friendly framework for testing and evaluating LLMs, RAGs, and chatbots.

384 95 11

llm-sentry

Unified AI reliability platform. One install, 12 diagnostic engines. Continuous monitoring, fault diagnosis, and compliance for LLM pipelines.

380 0 0

mocktopus

🐙 Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing

342 8 0

pyllmtest

🚀 Comprehensive testing framework for LLM applications with semantic assertions, multi-provider support, RAG testing, and prompt optimization. Test AI the right way!

325 1 0

agentcloudkelp

YAML-first stress testing for AI agents. Inject faults, catch behavioral drift, enforce cost budgets.

305 1 0