reliability-testing
Measure LLM output consistency from the command line.
A middleware for LangChain agents that intentionally injects failures (exceptions) into tool and model calls to test agent resilience.
Falsification-first reliability testing for AI systems: perturb inputs, preserve replayable evidence, diff reliability across model changes.