Andrew Lee (@ajyl) Bsky

If you liked Anthropic's recent emotions paper, check out our work! We find many similarities:
1) Circular geometry of emotion representations
2) Steering: unlike Anthropic, we steer along circular manifold (at 0°, 30°, 60°...)
3) Steering emotions can affect refusal/sycophancy
See Lihao's thread!👇

1 week ago 3 0 0 0

Mechanistic Interpretability Workshop at ICML 2026 The Mechanistic Interpretability Workshop at ICML 2026. How can we use the internals of neural networks to understand a model better?

CFP: mechinterpworkshop.com
Contact: mechinterpworkshop@gmail.com
Deadline: May 8 (FYI: we are accepting Neurips format submissions 😉)

Looking forward to see everyone’s work!
–@@neelnanda.bsky.social, Andy Arditi, Stefan Heimersheim, Anna Soligo, @jrosseruk.bsky.social, Iván Arcuschin

3 weeks ago 0 0 0 0

We are thrilled to host the next Mech Interp Workshop @ ICML 2026! 🎉
July 2026, Seoul 🇰🇷
The workshop aims to understand the inner workings of neural nets.

Topics:
Feature geometry
Circuit analyses
Interp for {practical applications, safety, scientific discovery}, and many more.

3 weeks ago 1 0 1 0

Thank you Naomi!!

6 months ago 1 0 0 0

Question @neuripsconf.bsky.social
- a coauthor had his reviews re-assigned many weeks ago. The ACs of those papers told him "i've been told to tell u: leave a short note. You won't be penalized". Now I'm being warned of desk-reject due to his short/poor reviews. What's the right protocol here?

9 months ago 0 0 0 0

How do language models track mental states of each character in a story, often referred to as Theory of Mind?

We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!

9 months ago 59 19 2 1

🚨New #ACL2025 paper!

Today’s “safe” language models can look unbiased—but alignment can actually make them more biased implicitly by reducing their sensitivity to race-related associations.

🧵Find out more below!

10 months ago 12 2 1 1

ARBOR

This project was done via Arbor! arborproject.github.io
Check us out to see on-going work to interp reasoning models.
Thank you collaborators! Lihao Sun,
@wendlerc.bsky.social ,
@viegas.bsky.social ,
@wattenberg.bsky.social

Paper link: arxiv.org/abs/2504.14379
9/n

11 months ago 1 0 0 0

Our interpretation:
✅we find subspace critical for self-verif.
✅in our setup, prev-token heads take resid-stream into this subspace. In a different task, a diff. mechanism may be used.
✅ this subspace activates verif-related MLP weights, promoting tokens like “success”

8/n

11 months ago 1 0 1 0

We find similar verif. subspaces in our base model and general reasoning model (DeepSeek R1-14B).

Here we provide CountDown as a ICL task.

Interestingly, in R1-14B, our interventions lead to partial success - the LM fails self-verification but then self-corrects itself.

7/n

11 months ago 0 0 1 0

Our analyses meet in the middle:

We use “interlayer communication channels” to rank how much each head (OV circuit) aligns with the “receptive fields” of verification-related MLP weights.

Disable *three* heads → disables self-verif. and deactivates verif.-MLP weights.

6/n

11 months ago 0 0 1 0

Bottom-up, we find previous-token heads (i.e., parts of induction heads) are responsible for self-verification in our setup. Disabling previous-token heads disables self-verification.

5/n

11 months ago 0 0 1 0

More importantly, we can use the probe to find MLP weights related to verification. Simply check for MLP weights with high cosine similarity to our probe.

Interestingly, we often see Eng. tokens for "valid direction" and Chinese tokens for "invalid direction".

4/n

11 months ago 1 0 2 0

We do a “top-down” and “bottom-up” analysis. Top-down, we train a probe. We can use our probe to steer the model and trick it to have found a solution.

3/n

11 months ago 1 0 1 0

CoT is unfaithful. Can we monitor inner computations in latent space instead?

Case study: Let’s study self-verification!

Setup: We train Qwen-3B on CountDown until mode collapse, resulting in nicely structured CoT that’s easy to parse+analyze

2/n

11 months ago 0 0 1 0

🚨New preprint!

How do reasoning models verify their own CoT?
We reverse-engineer LMs and find critical components and subspaces needed for self-verification!

1/n

11 months ago 17 3 1 0

a man with glasses is sitting on a couch and saying pretty pretty pretty good . ALT: a man with glasses is sitting on a couch and saying pretty pretty pretty good .

11 months ago 0 0 0 0

Interesting, I didn't know that! BTW, we find similar trends in GPT2 and Gemma2

11 months ago 1 0 0 0

Shared Global and Local Geometry of Language Model Embeddings Researchers have recently suggested that models share common representations. In our work, we find that token embeddings of language models exhibit common geometric structure. First, we find ``global'...

Next time a reviewer asks, “Why didn’t you include [insert newest LM]?”, depending on your claims you could argue that your findings will generalize to other models, based on our work! Paper link: arxiv.org/abs/2503.21073

11 months ago 21 5 1 0

We call this simple approach Emb2Emb: Here we steer Llama8B using steering vectors from Llama1B and 3B:

11 months ago 2 0 1 0

Now, steering vectors can be transferred across LMs. Given LM1, LM2 & their embeddings E1, E2, fit a linear transform T from E1 to E2. Given steering vector V for LM1, apply T onto V, and now TV can steer LM2. Unembedding V or TV shows similar nearest neighbors encoding the steer vector’s concept:

11 months ago 2 0 1 0

Local2: We measure intrinsic dimension (ID) of token embeddings. Interestingly, ID reveals that tokens with low ID form very coherent semantic clusters, while tokens with higher ID do not!

11 months ago 3 1 1 0

Local: we characterize two ways: first using Locally Linear Embeddings (LLE), in which we express each token embedding as the weighted sum of its k-nearest neighbors. It turns out, the LLE weights for most tokens look very similar across language models, indicating similar local geometry:

11 months ago 4 0 1 0

We characterize “global” and “local” geometry in simple terms.
Global: how similar are the distance matrices of embeddings across LMs? We can check with Pearson correlation between distance matrices: high correlation indicates similar relative orientations of token embeddings, which is what we find

11 months ago 4 1 1 0

🚨New Preprint! Did you know that steering vectors from one LM can be transferred and re-used in another LM? We argue this is because token embeddings across LMs share many “global” and “local” geometric similarities!

11 months ago 61 13 3 3

Cool! QQ: say I have a "mech-interpy finding": for instance, say I found a "circuit" - is such a finding appropriate to submit, or is the workshop exclusively looking for actionable insights?

1 year ago 2 0 0 0

Bridging the Digital Divide: Performance Variation across Socio-Economic Factors in Vision-Language Models Joan Nwatu, Oana Ignat, Rada Mihalcea. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.

I think these papers are highly relevant but missing!
aclanthology.org/2023.emnlp-m...
arxiv.org/pdf/2403.07687

1 year ago 2 0 1 0

My website has a personal readme file with step by step instructions on how to make an update. I would need to hire someone if something were to ever happen to that readme file.

1 year ago 1 0 0 0

Today we launch a new open research community

It is called ARBOR:
arborproject.github.io/

please join us.
bsky.app/profile/ajy...

1 year ago 15 5 1 2

ARBORproject arborproject.github.io · Discussions Explore the GitHub Discussions forum for ARBORproject arborproject.github.io. Discuss code, ask questions & collaborate with the developer community.

Check out on-going projects here! github.com/ARBORproject... or join our discord: discord.gg/SeBdQbRPkA
We hope to see your contributions!

Organizers: @wattenberg.bsky.social @viegas.bsky.social @davidbau.bsky.social @wendlerc.bsky.social @canrager.bsky.social
7/N

1 year ago 2 0 0 0

Posts by Andrew Lee