PyRank
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About

Llm Testing Python Packages

Python packages with the GitHub topic llm-testing. Sorted by relevance, with stars and monthly downloads.
Basaltlabs-app
gauntlet-cli

Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.

3K 6 0
Pacific-AI-Corp
langtest

Deliver safe & effective language models

2K 557 49
raga-ai-hub
agentneo

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

2K 16K 4K
JohnSnowLabs
nlptest

Deliver safe & effective language models

2K 557 49
LLAMATOR-Core
llamator

Red Teaming python-framework for testing chatbots and GenAI systems.

1K 211 20
NahuelGiudizi
ai-safety-tester

LLM security testing framework with CVE-style severity scoring and multi-model benchmarking

1K 0 0
AquibNawab
agentcloudkelp

YAML-first stress testing for AI agents. Write a contract, inject faults, catch behavioral drift, enforce cost budgets. No Python test code needed — just kelp.yaml and a terminal.

994 1 0
qualixar
agentassay

Token-efficient stochastic testing for AI agents. 5-20x cost reduction. 10 framework adapters. Paper: arXiv:2603.02601

952 5 1
Swanand33
llm-behave

Behavioral testing for LLM applications. pytest plugin with semantic assertions, multi-turn conversation testing, and drift detection. No LLM judge needed.

818 1 0
ssilwal29
api-test-ninja

API Testing Framework to automate and simplify API testing using LLM Agents and tests defined in plain English.

649 2 1
nullpointerdepressivedisorder
infer-check

Catches the correctness bugs that benchmarks miss in LLM inference engines

614 2 0
vincentkoc
tinyqabenchmarkpp

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

422 15 0
Rowusuduah
llm-sentry

Unified AI Reliability Platform. One install, 12 diagnostic engines. Zero-dependency LLM pipeline monitoring.

359 0 0
Addepto
ccheck

A human-friendly framework for testing and evaluating LLMs, RAGs, and chatbots.

345 95 11
evalops
mocktopus

🐙 Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing

316 6 0
RahulMK22
pyllmtest

🚀 Comprehensive testing framework for LLM applications with semantic assertions, multi-provider support, RAG testing, and prompt optimization. Test AI the right way!

267 1 0
tm243
agent-assembly-line

The simple way to build and embed AI agents into any software stack. Code-native, modular, and LLM-agnostic.

216 1 1
sazed5055
llmtest-framework

pytest for LLM apps - Test for grounding failures, prompt injection, safety violations, and regressions

212 3 0
adwantg
toolcallcheck

Deterministic Python testing for tool-using agents. Mock MCP tools, assert exact tool calls and trajectories, verify headers, and run offline in CI.

195 0 0
syncreus
syncreus-eval

Evaluate your LLM apps with one function call. Hallucination detection, RAG scoring, and agent evals for OpenAI, Anthropic, and more. 14 evaluators, pytest plugin, composite trust scores.

136 2 0
LGTMLabs
misalign

A Python library testing LLMs with prompts

134 0 0
chanikkyasaai
trajex

AI agent behavioral testing — learns what correct looks like, catches deviations automatically. Zero API keys needed.

109 0 0
dariero
ragaliq

LLM & RAG evaluation testing framework — hallucination detection, faithfulness metrics, answer relevance scoring, and retrieval pipeline testing with pytest integration

88 1 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery