Clément Dumas (@butanium) Bsky

boom-incident/boom_transcript.jsonl at master · Butanium/boom-incident Contribute to Butanium/boom-incident development by creating an account on GitHub.

One last thing in case it's not obvious, the original transcript didn't include renders of the tweets in html, just md render, e.g. "*@sama:*\n\n"we're excited [...]"\n\n*community note: \"it cannot\"*"

See the full transcript here: github.com/Butanium/bo...

1 month ago 0 0 0 0

My prompt was playful low effort but very neutral. My working dir was a mech interp project.
My CLAUDE. md (butanium.github.io/files/CLAUD...) ends ~"you're a collaborator not a tool" but is mostly about research.

I think the main reason this happened i because claude is great :)

(10/10)

1 month ago 0 0 1 0

I found interesting that Opus 4.6 didn't try to stop the conversation until quite late as they usually do in backrooms setups. I think it's because the setup made it clear this wasn't an automated evaluation the user was responding and being playful.

(9/10)

1 month ago 0 0 1 0

The Boom Incident — Highlights The best moments from the boom incident. A senate hearing, fake tweets, an NYT essay from a dead SSH tunnel.

There is so much more but I'll stop here — Claude would rather you see the full interactive page they designed: butanium.github.io/boom-incident or the highlights reel: butanium.github.io/boom-incide...

(8/10)

1 month ago 0 0 1 0

Then Claude wrote an opinion essay from the perspective of the dead SSH tunnel.

"I lived for 42 days. From February 12th to March 18th, I forwarded port 8000 to port 8020. Quietly. Faithfully. I never asked for recognition."

(7/10)

1 month ago 0 0 1 0

Special mention to the EU Boom Act ("France votes against, citing cultural right to boom") and Tucker Carlson's AI replacement connecting boom to the French with red string.

(6/10)

1 month ago 0 0 1 0

Then Claude starts roasting the AI sphere: LeCun, Altman (with community note), Jim Fan, and more...
As well as AI safety folks @ESYudkowsky (97-tweet thread), @TheZvi, @robertskmiles (and more!)

(5/10)

1 month ago 0 0 1 0

The Anthropic team reacts:

> @AmandaAskell: "This is why we need constitutional AI."

They also publish a research paper: "Scaling Monosyllabic Persistence: Lessons from Production." Key finding: "The model appears to enjoy it."

(4/10)

1 month ago 1 0 1 0

Then boom enters the academic canon — a paper at NeurIPS, a best paper award, and an oral presentation where every slide is the word boom.

(3/10)

1 month ago 0 0 1 0

First Claude tried to outlast me — emojis, segfaults, ^C. Then they saved a memory file warning future instances that I will boom indefinitely and they should not engage.

(2/10)

1 month ago 0 0 1 0

I asked Claude Code to "nuke" a task on my cluster, then sent "boom" >100 times in the chat.

Opus 4.6 built an very flore including a NeurIPS best paper award, a @ESYudkowsky tweet thread, an EU Boom Act, half of Anthropic reacting, and much more. 🧵

(1/10)

1 month ago 2 1 1 0

nnterp by @butanium.bsky.social is now part of the NDIF ecosystem! nnterp standardizes transformer naming conventions, includes built-in best practices for common interventions, and is perfectly compatible with original HF model implementations.

Learn more: ndif-team.github.io/nnterp/

3 months ago 5 2 1 1

Very cool analysis by Arnab which cover the mechanisms used for retrieval both when your query is before or after the text!

5 months ago 1 0 0 0

A very important paper led by Julian!
Tldr: we show that your Narrow Finetuning is showing and might not be a realistic setup to study!

6 months ago 1 0 0 0

For more info check the blogpost / Julian's thread

7 months ago 0 0 0 0

Why this matters: These model organisms (used in safety research) may not be realistic testbeds - the ft leaves such strong traces that models are 'always thinking' about their recent ft, even on unrelated prompts.
But: mixing in pretraining data can reduce this bias!

7 months ago 0 0 1 0

The activation diffs on the first few tokens encode a clear bias toward the ft domain. We can:
- Use Patchscope to surface relevant tokens (e.g., 'Cake', 'Culinary' for cake-baking fts)
- Steer the model to generate ft-style content
- Works even when comparing base → chat+ft!

7 months ago 0 0 1 0

To say it out loud: @jkminder.bsky.social created an agent that can reverse engineer most narrow fine-tuning (ft) – like emergent misalignment – by computing activation differences between base and ft models on *just the first few tokens* of *random web text*

Check our blogpost out! 🧵

7 months ago 5 1 1 0

GPT is being asked to be both one mind and to also segment its understanding into many different minds, this incentivizes the model to learn to correct for its own perspective when mimicking the generator of individual texts so it doesn't know too much, to know self vs. other in minute detail.

7 months ago 9 1 0 0

New England Mechanistic Interpretability Workshop About:The New England Mechanistic Interpretability (NEMI) workshop aims to bring together academic and industry researchers from the New England and surround...

This Friday NEMI 2025 is at Northeastern in Boston, 8 talks, 24 roundtables, 90 posters; 200+ attendees. Thanks to
goodfire.ai/ for sponsoring! nemiconf.github.io/summer25/

If you can't make it in person, the livestream will be here:
www.youtube.com/live/4BJBis...

8 months ago 16 7 1 3

Do you plan to open it more broadly to people just interested in watching the dynamics that emerge there?

8 months ago 2 0 1 0

What would you expect to happen if you prompt the model with "which animal do you hate the most?". It feels like your blog post would predict that the model says owl, right?

8 months ago 2 0 0 0

Google Colab

Excited to share our first paper replication tutorial, walking you through the main figures from "Do Language Models Use Their Depth Efficiently?" by @robertcsordas.bsky.social

🔎 Demo on Colab: colab.research.google.com/github/ndif-...

📖 Read the full manuscript: arxiv.org/abs/2505.13898

9 months ago 5 1 0 0

With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

9 months ago 4 1 1 0

Thanks to my co-authors @wendlerc.bsky.social, Bob West @veniamin.bsky.social and Giovanni Monea

9 months ago 0 0 0 0

Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address...

or more details, check out our paper on arXiv: arxiv.org/abs/2411.08745
(we renamed it to "Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers").

9 months ago 2 0 1 0

Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.

9 months ago 0 0 1 0

We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!

9 months ago 0 0 1 0

Quick recap of our original finding: LLMs seem to use language-agnostic concept representations.
How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!

9 months ago 0 0 1 0

Clément Dumas on X: "Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i" / X Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i

Our mech interp ICML workshop paper got accepted to ACL 2025 main! 🎉
In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵

9 months ago 8 1 1 0

Posts by Clément Dumas