Eric Todd (@ericwtodd) Bsky

(before the sharp green jump), the model's answers become more varied, no longer only predicting the two symbols in the query. But after knowing how to copy, the identity accuracy increases again because it learns to how to do identity demotion alongside query promotion it learned before.

2 days ago 1 0 0 0

Hey Naomi, I realize I never got back to you. I looked into what's going with the transient identity, and yes - the identity is learned in two parts, the first bump is query promotion (QP) where the model only promotes symbols in the question. When the model is learning how to do copying, 1/2

2 days ago 1 0 1 0

Eric Todd (@ericwtodd.bsky.social) Can you solve this algebra puzzle? 🧩 cb=c, ac=b, ab=? A small transformer can learn to solve problems like this! And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:🧵⬇️

I'll be attending #ICLR2026 next week to present my work on In-Context Algebra! My poster will be on Fri, April 24 at 3:15-5:45PM at Pavilion 4 P4-#4011. If you're around, stop by and say hello! My DMs are open if you want to connect or meet up in Rio!
bsky.app/profile/eri...

2 days ago 12 0 0 0

New paper: LLMs encode harmful content generation in a distinct, unified mechanism

Using weight pruning, we find that harmful generation depends on a tiny subset of the weights that are shared across harm types and separate from benign capabilities.

🧵

1 week ago 7 2 1 0

This is a great question! I'm actually not sure why this happens. I do know that the identity accuracy in (3) comes from query promotion - it's close to random guessing of query symbols, and that identity demotion is learned in (7), but I will check out some of these checkpoints and let you know!

2 months ago 2 0 0 0

The Art of Wanting.

About the question I see as central in AI ethics, interpretability, and safety. Can an AI take responsibility? I do not think so, but *not* because it's not smart enough.

davidbau.com/archives/20...

2 months ago 10 3 1 0

Can models understand each other's reasoning? 🤔

When Model A explains its Chain-of-Thought (CoT) , do Models B, C, and D interpret it the same way?

Our new preprint with @davidbau.bsky.social and @csinva.bsky.social explores CoT generalizability 🧵👇

(1/7)

2 months ago 28 8 1 0

In-Context Algebra Understanding the learned algorithms of transformer language models solving abstract algebra problems through in-context learning.

Takeaway: contextual reasoning can be richer than just fuzzy copying!

See the paper for more results, including an analysis of learning dynamics. Work done w/ @jannikbrinkmann.bsky.social, @rohitgandikota.bsky.social & @davidbau.bsky.social!

📜: arxiv.org/abs/2512.16902
🌐: algebra.baulab.info

2 months ago 12 2 0 0

Another strategy infers meaning using sets.

We have seen models keep track of "positive" and "negative" sets that let it narrow its understanding of a symbol using Sudoku-style cancellation.

Red bars (a) show the positive set and blue boxes (b) show the negative.

2 months ago 6 0 1 0

What in-context mechanisms do we find, other than copying?

The first one is the "identity rule". Here, the answer is the same as the question after eliminating a recognized "identity" from the question, like "ab=a".

@taylorwwebb.bsky.social has seen this in LLMs too!
bsky.app/profile/tay...

2 months ago 7 0 1 0

Our work maps out several context-based algorithms (copy, identity, commutativity, cancellation, & associativity). We use targeted data distributions to measure and dissect each strategy.

These five strategies explain almost all of our model's in-context performance!

2 months ago 8 0 1 0

If you pick a random puzzle (try one here: algebra.baulab.info), you'll see there's often more than one way to understand context.

@nelhage.bsky.social & @neelnanda.bsky.social found LLMs infer meaning by induction-style copying, and that happens here too. But there are many other strategies.

2 months ago 6 0 1 0

Can you solve this algebra puzzle? 🧩

cb=c, ac=b, ab=?

A small transformer can learn to solve problems like this!

And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:🧵⬇️

2 months ago 49 11 2 2

Humans and LLMs think fast and slow. Do SAEs recover slow concepts in LLMs? Not really.

Our Temporal Feature Analyzer discovers contextual features in LLMs, that detect event boundaries, parse complex grammar, and represent ICL patterns.

5 months ago 20 8 1 1

LLMs have been shown to provide different predictions in clinical tasks when patient race is altered. Can SAEs spot this undue reliance on race? 🧵

Work w/ @byron.bsky.social

Link: arxiv.org/abs/2511.00177

5 months ago 5 2 1 1

Interested in doing a PhD at the intersection of human and machine cognition? ✨ I'm recruiting students for Fall 2026! ✨

Topics of interest include pragmatics, metacognition, reasoning, & interpretability (in humans and AI).

Check out JHU's mentoring program (due 11/15) for help with your SoP 👇

5 months ago 27 15 0 1

How can a language model find the veggies in a menu?

New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options.

Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! 🧵

5 months ago 24 9 1 2

Looking forward to attending #COLM2025 this week! Would love to meet up and chat with others about interpretability + more. DMs are open if you want to connect. Be sure to checkout @sfeucht.bsky.social's very cool work on understanding concepts in LLMs tomorrow morning (Poster 35)!

6 months ago 2 0 0 0

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

6 months ago 41 15 2 2

Who is going to be at #COLM2025?

I want to draw your attention to a COLM paper by my student @sfeucht.bsky.social that has totally changed the way I think and teach about LLM representations. The work is worth knowing.

And you can meet Sheridan at COLM, Oct 7!
bsky.app/profile/sfe...

6 months ago 39 8 1 2

Announcing a broad expansion of the National Deep Inference Fabric.

This could be relevant to your research...

6 months ago 11 3 1 2

"AI slop" seems to be everywhere, but what exactly makes text feel like "slop"?

In our new work (w/ @tuhinchakr.bsky.social, Diego Garcia-Olano, @byron.bsky.social ) we provide a systematic attempt at measuring AI "slop" in text!

arxiv.org/abs/2509.19163

🧵 (1/7)

6 months ago 33 17 1 1

Wouldn’t it be great to have questions about LM internals answered in plain English? That’s the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuanced—and verbalizers might not tell us what we hope they do. 🧵👇1/8

7 months ago 26 8 1 1

New England Mechanistic Interpretability Workshop About:The New England Mechanistic Interpretability (NEMI) workshop aims to bring together academic and industry researchers from the New England and surround...

This Friday NEMI 2025 is at Northeastern in Boston, 8 talks, 24 roundtables, 90 posters; 200+ attendees. Thanks to
goodfire.ai/ for sponsoring! nemiconf.github.io/summer25/

If you can't make it in person, the livestream will be here:
www.youtube.com/live/4BJBis...

8 months ago 16 7 1 3

We've added a quick new section to this paper, which was just accepted to @COLM_conf! By summing weights of concept induction heads, we created a "concept lens" that lets you read out semantic information in a model's hidden states. 🔎

8 months ago 7 1 1 0

Im excited for NEMI again this year! I’ve enjoyed local research meetups and getting to know others near me working on interesting problems.

9 months ago 1 0 0 0

NEMI 2024 (Last Year)

🚨 Registration is live! 🚨

The New England Mechanistic Interpretability (NEMI) Workshop is happening Aug 22nd 2025 at Northeastern University!

A chance for the mech interp community to nerd out on how models really work 🧠🤖

🌐 Info: nemiconf.github.io/summer25/
📝 Register: forms.gle/v4kJCweE3UUH...

9 months ago 10 8 0 1

How do language models track mental states of each character in a story, often referred to as Theory of Mind?

We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!

9 months ago 59 19 2 1

Can we uncover the list of topics a language model is censored on?

Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:

10 months ago 9 4 1 0

I'm not familiar with the reviewing load for ARR, but for COLM this I was only assigned 2 papers as a reviewer which is great. I had more time to try and understand each submission and it was much more manageable than getting assigned 6+ papers like ICML and NeurIPS do.

10 months ago 5 0 0 0

Posts by Eric Todd