Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.
Behavioral auditing toolkit for LLMs — audit any model across 8 dimensions (factual, toxicity, bias, sycophancy, reasoning, refusal, deception, over-refusal) using teacher-forced confidence probes.