Advertisement · 728 × 90

Posts by Zhuofan Josh Ying

Preview
The Truthfulness Spectrum Hypothesis Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis...

8/8 📄Read the full paper here: arxiv.org/abs/2602.20273

Joint work with
@shauli.bsky.social, Niko Kriegeskorte, and @peterbhase.bsky.social

1 month ago 1 0 0 0

7/8 Final takeaways: the spectrum structure matters! Train on more domains to get domain-general directions for monitoring, but use domain-specific ones for intervention. Probe geometry reliably predicts how probes will transfer and is reshaped by post-training.

1 month ago 0 0 1 0
Post image

6/8 Surprising causal exps: domain-specific directions steer better than domain-general ones!

Takeaway: Domain-general probes may be great for monitoring, but intervention seems to need domain-specific representations.

1 month ago 0 0 1 0
Post image

5/8 Further concept erasure of single domains shows directions of intermediate-level generality, suggesting different truth types share partially overlapping but distinct sets of truth dimensions.

1 month ago 0 0 1 0
Post image

4/8 Beyond just observing the spectrum, we propose Stratified INLP: an iterative erasure procedure that first extracts highly domain-general directions, then removes them to reveal highly domain-specific directions.

This lets us constructively identify both ends of the spectrum

1 month ago 0 0 1 0
Post image

3/8 Post-training reorganizes truth geometry.

In base models, sycophantic lying is more aligned with other types of lying, until post-training pushes them apart!

This gives a representational account of why chat models are more sycophantic than base models.

1 month ago 0 0 1 0
Post image

2/8 Why do some probes transfer and others don't? Geometry tells you!

Mahalanobis cosine similarity between probe directions, which reweights by data covariance to focus on directions that matter, perfectly predicts OOD generalization (R²=0.98). Standard cossim? Only R² =0.56.

1 month ago 0 0 1 0
Advertisement
Post image

1/8 We build FLEED (definitional, empirical, logical, fictional, ethical truth) + new sycophantic lying + expectation-inverted datasets. Prior and our probes completely fail on sycophantic lying!

Yet training on all domains works everywhere!
Takeaway: train on more diverse data!

1 month ago 0 0 1 0
Post image

🔍Truthfulness probes and their causal effects vary widely: some generalize, others are domain-dependent. Why?

We propose the Truthfulness Spectrum Hypothesis: truth directions of varying generality coexist! Probe geometry predicts generalization, and post-training reshapes it!
🧵⬇️

1 month ago 5 0 1 0
Post image

In this amazing multidisciplinary collaboration, we report our early experience with the @openclaw-x.bsky.social ->

1 month ago 40 22 1 10
It's Owl in the Numbers: Token Entanglement in Subliminal Learning Entangled tokens help explain subliminal learning.

1/6 🦉Did you know that telling a language model that it loves the number 087 also makes it love owls?

In our new blogpost, It’s Owl in the Numbers, we found this is caused by entangled tokens - seemingly unrelated tokens that are linked. When you boost one, you boost the other.

owls.baulab.info/

8 months ago 7 4 1 0