model-testing
yuragi — LLM Confidence Fragility Analyzer. Perturbation-driven hallucination detection with workshop-grade real benchmarks (TruthfulQA n=412 ensemble AUC 0.73, TriviaQA n=200 confidence-inversion AUC 0.75).
A python package, command-line tool, and Shiny application to compare short tandem repeat (STR) profiles.