Hallucination Detection Python Packages

entroly-core

Cut your Claude / OpenAI / Gemini bill 70–95% on AI coding. Local proxy that compresses context, keeps provider caches hot, and verifies LLM output ($0 hallucination guard). Drop-in for Cursor, Claude Code, Codex, Aider + 34 more and custom providers — 30s, no code changes

15K 419 66

styxx

The measurement layer for machine minds. Reads what a model means and whether it holds the truth; certifies every claim re-runs. meaning_diff + OATH certify + mind profiles + live grounding signal + the cognometric instruments. No torch, no LLM in the loop for the core; MIT, open at the core.

13K 13 1

entroly

10K 419 66

multivon-eval

Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI/CD integration.

9K 8 0

uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

7K 1K 127

lettucedetect

Lightweight hallucination detection framework for RAG applications

5K 581 48

groundlens

Geometric LLM grounding verification — deterministic, auditable, no second LLM. Python library for measuring how faithfully model outputs reflect their sources.

4K 5 0

backfire-kernel

Real-time LLM hallucination guardrail — NLI + RAG fact-checking with opt-in claim-level streaming contradiction halt. Drop-in for any LLM backend.

4K 1 0

insa-its

Runtime Security for Multi-Agent AI — Website & Documentation

4K 36 3

proofagent-harness

Open-source test harness for AI agents. Stress-test production agents with adversarial multi-turn scenarios in CI

3K 6 2

director-ai

Real-time LLM hallucination guardrail — NLI + RAG fact-checking with opt-in claim-level streaming contradiction halt. Drop-in for any LLM backend.

3K 1 0

peekr

Zero-config observability for AI agents. Auto-instruments OpenAI & Anthropic SDKs.

3K 3 0

uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

2K 2K 202

wauldo-nemo

Independent claim-level answer verification as NeMo Guardrails output rails, powered by Wauldo.

2K 0 0

qwed

A deterministic verification layer for AI systems. QWED verifies AI outputs using mathematics, symbolic reasoning, and formal methods (Z3, SMT, SymPy), creating an auditable trust boundary for agentic AI. Not generation. Verification.

2K 58 10

dingo-python

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

1K 720 74

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

1K 6 0

longtrainer

Production-ready RAG framework for Python — multi-tenant chatbots with streaming, tool calling, agent mode (LangGraph), vector search (FAISS), and persistent MongoDB memory. Built on LangChain.

1K 30 2

groundguard

Verify LLM output against your source documents. Catch hallucinations in RAG pipelines and agentic workflows before they reach users.

1K 0 0

ifixai

Catch your AI's mistakes and blind spots before your customers or regulators do. iFixAi runs 45 inspections, 32 graded core plus 13 extended for frontier risks like sabotage, sandbagging, and oversight evasion. It returns a letter grade in under 5 minutes. Industry and model agnostic.

1K 1K 155

sovereign-shield

Production-grade AI defense — hybrid deterministic filters + optional LLM veto + HITL approval + file validation + hallucination detection. OS-enforced immutability.

1K 19 7

veralith

Hallucination diagnosis for RAG systems — Sufficiency, Faithfulness, Completeness verdicts plus rule-based remediation.

766 2 0

receipts-gate

Force AI agents to back every claim with evidence or declare it an assumption without guessing

733 1 0

yuragi

LLM Confidence Fragility Analyzer — Measure how fragile your AI's confidence really is

610 0 0