Wonderful collaboration with @boyiwei.bsky.social Kaden Zheng
@wattenberg.bsky.social @peterhenderson.bsky.social Seraphina Goldfarb-Tarrant and @boknilev.bsky.social
Posts by Hadas Orgad
We use weight pruning here as a causal probe of model internals, not a deployment-ready defense.
But — this opens a path toward *mechanistic alignment*: considering the mechanisms behind harmful behavior, rather than training behavioral guardrails on top of them.
And it's specifically the generative mechanism we remove: fine-tuning on harmful examples can mostly restore it—confirming our point that the underlying knowledge is largely intact.
Generating harmful content is dissociated from "understanding" it.
Pruned models retain nearly full ability to detect, explain, and refuse harmful requests.
Refusal and generation are *double dissociated*: pruning one leaves the other intact.
And here as well, it generalizes between domains (plus, we observe significant intersection of weight sets).
This also explains emergent misalignment ( @BetleyJan et al.)
If harmful behaviors share weights, then fine-tuning one narrow domain can unintentionally affect others.
→ Pruning a narrow misaligned domain substantially reduces emergent misalignment.
This compression seems to be a product of alignment training.
Aligned models show far greater separation between harmful and benign weights than their pretrained counterparts.
Alignment reshapes the internals, even when behavioral guardrails remain brittle.
[more in the paper]
The mechanism is unified across harm types.
E.g., prune weights identified from malware generation → hate speech drops too.
Different harms rely on a shared underlying mechanism. These weights also heavily overlap.
Paper >> arxiv.org/abs/2604.09544
We use weight pruning to probe model internals.
Result: pruning ~0.0005% of model parameters, harmful generation drops dramatically, while general capabilities remain largely intact.
New paper: LLMs encode harmful content generation in a distinct, unified mechanism
Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities.
🧵
Full paper >> actionable-interpretability-guide.github.io/paper.pdf
Blog >> actionable-interpretability-guide.github.io
Joint work w/ amazing collaborators @fbarez.bsky.social @talhaklay.bsky.social @wordscompute.bsky.social @mariusmosbach.bsky.social @anja.re @nsaphra.bsky.social @byron.bsky.social @sarah-nlp.bsky.social @profericwong.bsky.social @iftenney.bsky.social @megamor2.bsky.social
We’re not saying all interpretability work must be immediately actionable— curiosity-driven research still matters. But actionability is a high bar: understanding that works outside the lab.
To make your next project more actionable, use our checklist >>
Actionable interpretability is worth aiming for. We identified five domains where answering *why* unlocks a fundamental advantage.
Interpretability isn't actionable (yet) for three reasons:
→ Papers aren't expected to demonstrate applications
→ Insights are shown in oversimplified settings without real baselines
→ Methods require domain expertise
Why haven't insights from interpretability transformed AI yet? Because we're not prioritizing actionable insights.
Full paper >> actionable-interpretability-guide.github.io/paper.pdf
Blog - The Hitchhiker's Guide 🧭 to
Actionable Interpretability >> actionable-interpretability-guide.github.io/
Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How?
We're ready to answer.
🧵
Deadline extended! ⏳
The Actionable Interpretability Workshop at #ICML2025 has moved its submission deadline to May 19th. More time to submit your work 🔍🧠✨ Don’t miss out!
Logo for MIB: A Mechanistic Interpretability Benchmark
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?
We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!
• Model Innovation – Designs and training inspired by interpretability.
• Impact Measurement – Benchmarks for real-world effectiveness.
• Critical Perspectives – Feasibility, limits, and future directions.
Website >>> actionable-interpretability.github.io
• Real-world Applications – Tackling bias, hallucinations, adversarial threats, and use in critical domains like healthcare, finance and cybersecurity.
• Method Comparison – Interpretability vs. alternative methods such as fine-tuning, prompting, etc.
We aim to foster discussions on how interpretability research can inform concrete improvements in model design, safety, and robustness.
Topics of interest: ⬇️