Tech industry mottos have a mixed track record. But we should hold idealists to their ideals. And we should celebrate when they come through.
The Mythos non-release is a remarkable moment of conviction. Thoughts:
davidbau.com/archives/20...
Bravo to Anthropic's "race the top".
Posts by Gabriele Sarti
Mfw fiddling with probes all day but patching experiments don't pan out
Congrats!
Thank you for having me! Next time in person! 🤗
Calling attention to an exciting "deception detection" hackathon we're planning this summer! w @NDIF and @CadenzaLabs.
Recruiting red teams now, blue teams later. Red teams, time is short: proposals due Mar 31. $10K stipend + compute, $15K finals prize.
nnsight.net/blog/2026/0...
I truly believe the rapid advances in the mech interp subfield have something real to offer AI ethics researchers: A chance to look beyond the HOW of evals to the WHY, a first pass at a technical solution when we see the opportunity, a new avenue for showing failures that prove models are not gods
Based
At long last we have created Palantir, from the classic fantasy novel Don't Create The Palantir
🧵New paper: "Lost in Backpropagation: The LM Head is a Gradient Bottleneck"
The output layer of LLMs destroys 95-99% of your training signal during backpropagation, and this significantly slows down pretraining 👇
Check out David's NetHack port!
"Complexity does not yield to speed. Judgment remains essential. The work of deciding what matters, of seeing what is hidden, of knowing when your own metrics are lying to you: this is the work that remains, and it is the work worth learning."
Era brillosto e i tospi agiluti
Facean girelli nella civa
Tutti i paprussi eran melacri
Ed il trugon striniva
This said, I do agree with the criticism re: model choice. LLaMA 3 70b capabilities are definitely too limited to display the kind of interesting (mis)aligned behaviors. The research question does stand regardless of model choice, though!
This is an important project! If you believe alignment faking is true, you should at least entertain the possibility of misalignment faking before drawing your conclusions.
Especially true if researchers fishing for misaligned behaviors are the ones running the evals!
BlackboxNLP is back once again at EMNLP'26! Very happy to be part of the team again, and excited for our new reproducibility track! Check it out ⬇️
tired: meta omni-translation to 1600 low-resource languages
wired: kagi translate english to mechinterp
I'm calling it DeepSeek's new 1T parameter model (V4)?
The style, content, and length of the reasoning are extremely similar.
My contribution to model welfare efforts for today
What's your ✨ convergent epistemic state ✨?
This morning I happened to hang out around the Harvard med school café and all conversations I overheard were about LLMs med assistants and XAI 🫡
I was puzzled by people doing this when they have a research background in different disciplines and in some case are employed full time by some big tech. Is reviewing a requirement to transition to research roles in Amazon/Google/Meta? 🤔
I want to talk about why AI-based mass surveillance is so dangerous, and why I would oppose it no matter which party or president is in office.
🔥Super excited to share our new demo website for 🪄Interpreto!
🖼️It is basically an explanation gallery showcasing attribution and concept-based explanations for classification and generation.
🎮Play with it: for-sight-ai.github.io/interpreto-d...
We will keep improving it, so stay tuned!
Great wrap-up for #EVALITA2026! 🔥
Glad to have helped organize this edition and to see many interesting discussions!
Great response to our task Cruciverb-IT (with Ciaccio C., @gsarti.com, Dell’Orletta F., @malvinanissim.bsky.social)!
Thanks to all co-organizers and @ailc-nlp.bsky.social! #NLProc
Great release from our engineering team! A lot of the major pain points have been addressed, and this is our first step towards supporting interpretability workflows on more realistic scenarios! Check it out!
Those of us who work in AI in the US today should take a moment to think today. Do not get distracted by the circus. Instead, let us pause to think carefully about our freedoms, our rights, and our responsibilities as citizens and professionals.
It is a deadly serious moment.
I'm excited to share that this paper was accepted at ICLR 2026! We show that language models encode one of the most basic ingredients of a world model: the ability to distinguish plausible from implausible states. Check out the paper for more details!
See you in Rio!
Paper: arxiv.org/abs/2507.12553
Hopefully it will make it smaller tho!
In this amazing multidisciplinary collaboration, we report our early experience with the @openclaw-x.bsky.social ->
Are we all Agents of Chaos in AI? (Hope not!)
In recent weeks using OpenClaw has taught us a lot about this wooly new kind of autonomous software agent.
Its valuable to see what @NatalieShapira, @wendlerch et al. have seen:
agentsofchaos.baulab.info/
Our research report on red-teaming stateful OpenClaw agents in the BauLab is finally out! 🥳
This awesome effort was led by @natalieshapira.bsky.social and involved 6 ClawBots and 20 researchers from various institutions.
Check it out ➡️ agentsofchaos.baulab.info