Hadas Orgad (@hadasorgad) Bsky

Wonderful collaboration with @boyiwei.bsky.social Kaden Zheng
@wattenberg.bsky.social @peterhenderson.bsky.social Seraphina Goldfarb-Tarrant and @boknilev.bsky.social

1 week ago 0 0 0 0

Large Language Models Generate Harmful Content Using a Distinct,... Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains...

Curious? Read more in our new preprint: arxiv.org/abs/2604.09544

1 week ago 1 0 1 0

We use weight pruning here as a causal probe of model internals, not a deployment-ready defense.

But — this opens a path toward *mechanistic alignment*: considering the mechanisms behind harmful behavior, rather than training behavioral guardrails on top of them.

1 week ago 1 0 1 0

And it's specifically the generative mechanism we remove: fine-tuning on harmful examples can mostly restore it—confirming our point that the underlying knowledge is largely intact.

1 week ago 0 0 1 0

Generating harmful content is dissociated from "understanding" it.

Pruned models retain nearly full ability to detect, explain, and refuse harmful requests.

Refusal and generation are *double dissociated*: pruning one leaves the other intact.

1 week ago 1 0 1 0

And here as well, it generalizes between domains (plus, we observe significant intersection of weight sets).

1 week ago 1 0 1 0

This also explains emergent misalignment ( @BetleyJan et al.)

If harmful behaviors share weights, then fine-tuning one narrow domain can unintentionally affect others.

→ Pruning a narrow misaligned domain substantially reduces emergent misalignment.

1 week ago 2 0 1 0

This compression seems to be a product of alignment training.

Aligned models show far greater separation between harmful and benign weights than their pretrained counterparts.

Alignment reshapes the internals, even when behavioral guardrails remain brittle.

[more in the paper]

1 week ago 3 0 1 0

The mechanism is unified across harm types.
E.g., prune weights identified from malware generation → hate speech drops too.

Different harms rely on a shared underlying mechanism. These weights also heavily overlap.

1 week ago 2 0 1 0

Paper >> arxiv.org/abs/2604.09544

We use weight pruning to probe model internals.

Result: pruning ~0.0005% of model parameters, harmful generation drops dramatically, while general capabilities remain largely intact.

1 week ago 1 0 1 0

New paper: LLMs encode harmful content generation in a distinct, unified mechanism

Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities.

🧵

1 week ago 7 2 1 0

Full paper >> actionable-interpretability-guide.github.io/paper.pdf
Blog >> actionable-interpretability-guide.github.io

1 month ago 0 0 0 0

Joint work w/ amazing collaborators @fbarez.bsky.social @talhaklay.bsky.social @wordscompute.bsky.social @mariusmosbach.bsky.social @anja.re @nsaphra.bsky.social @byron.bsky.social @sarah-nlp.bsky.social @profericwong.bsky.social @iftenney.bsky.social @megamor2.bsky.social

1 month ago 0 0 1 0

We’re not saying all interpretability work must be immediately actionable— curiosity-driven research still matters. But actionability is a high bar: understanding that works outside the lab.

To make your next project more actionable, use our checklist >>

1 month ago 2 0 1 0

Actionable interpretability is worth aiming for. We identified five domains where answering *why* unlocks a fundamental advantage.

1 month ago 0 0 1 0

Interpretability isn't actionable (yet) for three reasons:
→ Papers aren't expected to demonstrate applications
→ Insights are shown in oversimplified settings without real baselines
→ Methods require domain expertise

1 month ago 0 0 1 0

Why haven't insights from interpretability transformed AI yet? Because we're not prioritizing actionable insights.

Full paper >> actionable-interpretability-guide.github.io/paper.pdf
Blog - The Hitchhiker's Guide 🧭 to
Actionable Interpretability >> actionable-interpretability-guide.github.io/

1 month ago 3 0 1 0

Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How?
We're ready to answer.
🧵

1 month ago 23 10 1 1

Deadline extended! ⏳

The Actionable Interpretability Workshop at #ICML2025 has moved its submission deadline to May 19th. More time to submit your work 🔍🧠✨ Don’t miss out!

11 months ago 4 3 0 0

Logo for MIB: A Mechanistic Interpretability Benchmark

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!

11 months ago 51 15 1 6

General Information ICML 2025 - Vancouver

• Model Innovation – Designs and training inspired by interpretability.
• Impact Measurement – Benchmarks for real-world effectiveness.
• Critical Perspectives – Feasibility, limits, and future directions.

Website >>> actionable-interpretability.github.io

1 year ago 3 0 0 0

• Real-world Applications – Tackling bias, hallucinations, adversarial threats, and use in critical domains like healthcare, finance and cybersecurity.
• Method Comparison – Interpretability vs. alternative methods such as fine-tuning, prompting, etc.

1 year ago 2 0 1 0

We aim to foster discussions on how interpretability research can inform concrete improvements in model design, safety, and robustness.

Topics of interest: ⬇️

1 year ago 3 0 1 0

Posts by Hadas Orgad