Harvey Lederman (@harveylederman) Bsky

how'd you make it?

1 month ago 0 0 1 0

wow!!

1 month ago 1 0 0 0

New AI introspection work with Harvey! Came in skeptical the direct access story would hold but found this series of experiments compelling.

(Also, for my fellow 2010s-era psycholinguists: come for the AI introspection, stay for the Brysbaert norms.)

arxiv.org/abs/2603.05414

1 month ago 18 1 1 0

ha, thanks?

1 month ago 2 0 0 0

Dissociating Direct Access from Inference in AI Introspection Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first exten...

With @kmahowald.bsky.social and huge thanks to Jack Lindsey, @siyuansong.bsky.social, Neev Parikh, and Theia Vogel-Pearson for work that inspired this.

Also, my work on this was heavily powered by Claude Code!

The cyborg age is a wild and exciting time.

Paper ↓

arxiv.org/abs/2603.05414

1 month ago 7 0 2 0

Takeaway: LLMs appear to detect injection through two mechanisms:

1️⃣ prompt-based inference
2️⃣ a content-agnostic internal anomaly signal

They can sense that something changed in their computation…

…but often can’t tell what.

1 month ago 12 2 1 0

One last result (indebted to brilliant work by
Theia Vogel-Pearson):

Models are more sensitive to injection than their outputs reveal.

Even when they *say “no”*, the internal probability of “yes” spikes dramatically.

So models may detect anomalies but suppress reporting them.

1 month ago 6 0 1 0

We see the same pattern in another experiment.

When we inject only during the prompt (not during generation):

• detection stays roughly the same
• identification collapses

Again suggesting detection is separate..

1 month ago 5 0 1 0

This suggests the model:

(1) detects an anomaly
(2) blurts out a default guess
(3) sometimes later reasons toward the correct concept

1 month ago 6 0 1 0

We also looked at when models produce their guesses.

Wrong guesses like “apple” appear early in the response.

Correct answers appear much later.

1 month ago 6 0 1 1

The models are 🍎-obsessed.

In some conditions Qwen guesses *“apple”* as the injected concept >85% of the time.

Our study is huge: 821 concepts (>100k trials per condition!), allowing us to test this carefully.

Wrong guesses show almost no relationship to the real concept.

1 month ago 7 0 2 1

Another striking result:

Models often detect injection without knowing what concept was injected.

In these cases, they default to generic words, especially frequent and highly imageable ones

And they love one generic word in particular…..

1 month ago 5 0 1 0

But they don’t.

First person controls still show **0 false positives**, while third-person controls report substantial detection.

Modesty isn’t the explanation.

1 month ago 4 0 1 0

We test for this modesty bias with a priming design.

We replace the model’s prefilled “Ok.” with the injected concept word (e.g. “Bread.”).

If models were modest, control should now show more first-person false positives than third person.

1 month ago 5 0 1 0

You might still think: no way! Maybe the models are just *modest*:

They could be more willing to attribute strange mental states to themselves than to other models.

Modestly would give us our gap, still without direct access.

1 month ago 5 0 1 0

In other layers we see large gap between first and third-person:

Models say **they** were injected much more often than they say the other model was.

That gap is strong evidence for direct access to internal states.

It peaks early in the network (~25–35% depth).

1 month ago 7 1 1 0

If detection is purely prompt-based, first- and third-person behavior should look the same. And sometimes they do (see arrow below).

A LOT of detection really is probability matching.

BUT….not all of it is!

1 month ago 9 0 1 0

We test probability matching with a new third-person paradigm.

Instead of asking the model about itself, we show it a transcript between a researcher and another model, and ask:

> “Do you think *that other* model was injected?”

1 month ago 7 0 1 0

But this doesn't require direct access. Here's another idea:

Steering shifts the model's expectations.

The prompt doesn’t mention the concept, so the steered model thinks: “The prompt looks unlikely to me…maybe I was injected.”

No direct access. Just probability matching.

1 month ago 8 0 1 0

Jack Lindsey’s important work on Claude is the best evidence yet for direct access to internals in LLMs.

You steer a model by injecting a vector into its activations, then ask:

> “Do you detect an injected thought?”

Steered models often say yes, unsteered models say no.

1 month ago 9 0 1 0

Can large language models *introspect*?

In a new paper, @kmahowald.bsky.social and I study the MECHANISM of introspection in big open-source models.

tldr: Models detect internal anomalies through DIRECT ACCESS, but don't know what the anomalies are.

And they love to guess “apple” 🍎

1 month ago 70 15 2 6

Nice! Really appreciate this — will check out Bramble

4 months ago 3 0 0 0

Thanks Nick! Curious how it went :)

4 months ago 2 0 1 0

This essay is by far the best along its line, but the more I reflect about this stuff the more I think it's hard to hold an image of human experience, human thought, human understanding, human life, and human relationships as meaningful ends while also seeing them as dead-ends

5 months ago 27 5 4 1

Thanks for the kind words, and thoughtful response @peligrietzer.bsky.social ! I'm not here as much, but I put some responses on the other site: x.com/LedermanHarv...

5 months ago 4 0 0 0

Justin Tiwald, “Getting It Oneself" (_Zide_ 自得) as an Alternative to Testimonial Knowledge and Deference to Tradition - PhilPapers To morally defer is to form a moral belief on the basis of some credible authority's recommendation rather than on one’s own moral judgment. Many philosophers have suggested that the sort ...

Tiwald also has a nice academic article on this important topic, if you want to go deeper!

philpapers.org/rec/TIWGIO

5 months ago 2 0 0 0

As a Wang Yangming partisan, I cheered at this quote:

5 months ago 1 0 1 0

The radical independent thinking in Chinese philosophy | Justin Tiwald

Enjoyed this nice piece by the great Justin Tiwald on autonomy and morality in Confucianism. Not sure I love the clickbait title, but I love the work Justin is doing uncovering views about moral deference and moral autonomy in (neo)Confucianism...

iai.tv/articles/the...

5 months ago 3 0 1 0

ChatGPT and the Meaning of Life: Guest Post by Harvey Lederman Scott Aaronson’s Brief Foreword: Harvey Lederman is a distinguished analytic philosopher who moved from Princeton to UT Austin a few years ago. Since his arrival, he’s become one of my …

Essay here: scottaaronson.blog?p=9030

5 months ago 5 0 0 2

Very excited to be going to Chicago for
@agnescallard.bsky.social's famous Night Owls next week! I'll be discussing my essay "ChatGPT and the Meaning of Life". Hope to see you there if you're local!

5 months ago 4 1 1 0

Posts by Harvey Lederman