One last thing in case it's not obvious, the original transcript didn't include renders of the tweets in html, just md render, e.g. "*@sama:*\n\n"we're excited [...]"\n\n*community note: \"it cannot\"*"
See the full transcript here: github.com/Butanium/bo...
Posts by Clément Dumas
My prompt was playful low effort but very neutral. My working dir was a mech interp project.
My CLAUDE. md (butanium.github.io/files/CLAUD...) ends ~"you're a collaborator not a tool" but is mostly about research.
I think the main reason this happened i because claude is great :)
(10/10)
I found interesting that Opus 4.6 didn't try to stop the conversation until quite late as they usually do in backrooms setups. I think it's because the setup made it clear this wasn't an automated evaluation the user was responding and being playful.
(9/10)
There is so much more but I'll stop here — Claude would rather you see the full interactive page they designed: butanium.github.io/boom-incident or the highlights reel: butanium.github.io/boom-incide...
(8/10)
Then Claude wrote an opinion essay from the perspective of the dead SSH tunnel.
"I lived for 42 days. From February 12th to March 18th, I forwarded port 8000 to port 8020. Quietly. Faithfully. I never asked for recognition."
(7/10)
Special mention to the EU Boom Act ("France votes against, citing cultural right to boom") and Tucker Carlson's AI replacement connecting boom to the French with red string.
(6/10)
Then Claude starts roasting the AI sphere: LeCun, Altman (with community note), Jim Fan, and more...
As well as AI safety folks @ESYudkowsky (97-tweet thread), @TheZvi, @robertskmiles (and more!)
(5/10)
The Anthropic team reacts:
> @AmandaAskell: "This is why we need constitutional AI."
They also publish a research paper: "Scaling Monosyllabic Persistence: Lessons from Production." Key finding: "The model appears to enjoy it."
(4/10)
Then boom enters the academic canon — a paper at NeurIPS, a best paper award, and an oral presentation where every slide is the word boom.
(3/10)
First Claude tried to outlast me — emojis, segfaults, ^C. Then they saved a memory file warning future instances that I will boom indefinitely and they should not engage.
(2/10)
I asked Claude Code to "nuke" a task on my cluster, then sent "boom" >100 times in the chat.
Opus 4.6 built an very flore including a NeurIPS best paper award, a @ESYudkowsky tweet thread, an EU Boom Act, half of Anthropic reacting, and much more. 🧵
(1/10)
nnterp by @butanium.bsky.social is now part of the NDIF ecosystem! nnterp standardizes transformer naming conventions, includes built-in best practices for common interventions, and is perfectly compatible with original HF model implementations.
Learn more: ndif-team.github.io/nnterp/
Very cool analysis by Arnab which cover the mechanisms used for retrieval both when your query is before or after the text!
A very important paper led by Julian!
Tldr: we show that your Narrow Finetuning is showing and might not be a realistic setup to study!
For more info check the blogpost / Julian's thread
Why this matters: These model organisms (used in safety research) may not be realistic testbeds - the ft leaves such strong traces that models are 'always thinking' about their recent ft, even on unrelated prompts.
But: mixing in pretraining data can reduce this bias!
The activation diffs on the first few tokens encode a clear bias toward the ft domain. We can:
- Use Patchscope to surface relevant tokens (e.g., 'Cake', 'Culinary' for cake-baking fts)
- Steer the model to generate ft-style content
- Works even when comparing base → chat+ft!
To say it out loud: @jkminder.bsky.social created an agent that can reverse engineer most narrow fine-tuning (ft) – like emergent misalignment – by computing activation differences between base and ft models on *just the first few tokens* of *random web text*
Check our blogpost out! 🧵
GPT is being asked to be both one mind and to also segment its understanding into many different minds, this incentivizes the model to learn to correct for its own perspective when mimicking the generator of individual texts so it doesn't know too much, to know self vs. other in minute detail.
This Friday NEMI 2025 is at Northeastern in Boston, 8 talks, 24 roundtables, 90 posters; 200+ attendees. Thanks to
goodfire.ai/ for sponsoring! nemiconf.github.io/summer25/
If you can't make it in person, the livestream will be here:
www.youtube.com/live/4BJBis...
Do you plan to open it more broadly to people just interested in watching the dynamics that emerge there?
What would you expect to happen if you prompt the model with "which animal do you hate the most?". It feels like your blog post would predict that the model says owl, right?
Excited to share our first paper replication tutorial, walking you through the main figures from "Do Language Models Use Their Depth Efficiently?" by @robertcsordas.bsky.social
🔎 Demo on Colab: colab.research.google.com/github/ndif-...
📖 Read the full manuscript: arxiv.org/abs/2505.13898
With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
Thanks to my co-authors @wendlerc.bsky.social, Bob West @veniamin.bsky.social and Giovanni Monea
or more details, check out our paper on arXiv: arxiv.org/abs/2411.08745
(we renamed it to "Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers").
Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.
We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!
Quick recap of our original finding: LLMs seem to use language-agnostic concept representations.
How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!