Mechanistic Interpretability Python Packages

pyvene

Stanford NLP Python library for understanding and improving PyTorch models via interventions

9K 888 109

glassbox-mech-interp

Open-source EU AI Act Annex IV documentation toolkit. Mechanistic interpretability + circuit discovery for transformers. One function call generates a structured, hash-chained evidence package.

4K 2 0

nnterp

Unified access to Large Language Model modules using NNsight

2K 115 12

steering-vectors

Steering vectors for transformer language models in Pytorch / Huggingface

2K 155 19

safety-compass

Detect safety degradation during LLM fine-tuning before it becomes behavioral

2K 1 0

llmoji

Making agents cuter

1K 2 0

carl-studio

CARL (Coherence-Aware Reinforcement Learning) — information-theoretic reward signals for LLM training via token-level probability distributions

562 3 1

warden-interp

Circuit-level regression testing for AI systems. Catch mechanistic drift that behavioral evals miss.

561 3 0

mlxlmprobe

Universal probing and interpretability tool for MLX language models on Apple Silicon

514 5 0

gavagai

Quantify translation indeterminacy between sparse autoencoder feature dictionaries (Quine × Mechanistic Interpretability).

496 0 0

openinterp

openinterp — Python SDK + CLI. FabricationGuard hallucination probe + ProbeBench leaderboard + Atlas search + Trace generation. pip install openinterp

475 0 1