Advertisement · 728 × 90

Posts by Lihao Sun

Post image

📢 Accepted to #ACL2026 Main Conference! See you in San Diego.
Thanks to all collaborators at Microsoft: Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan.

Paper: arxiv.org/pdf/2604.05655
Website: slhleosun.github.io/reasoning_tr...
Code: github.com/slhleosun/re...

8/

1 week ago 1 0 0 0
Post image

Reasoning length is also controllable. Steering hidden states toward the termination subspace shortens reasoning; steering away extends it. At moderate strengths this works as a smooth knob with minimal accuracy changes - push too hard and the model enters repetitive loops.

7/

1 week ago 1 0 1 0
Post image

We further introduce trajectory-based steering: when an ongoing trajectory drifts from the ideal path, we apply low-rank corrections to nudge it back. Most effective on harder math problems - +7.6pp on 6- and 7-step questions, with 97%+ preservation of correct solutions.

6/

1 week ago 1 0 1 0

Unconditionally injecting tokens like "Wait" often hurts accuracy (up to -36%, especially for fewer-step problems). Instead, intervene only when failure is predicted! Using our mid-reasoning correctness signal on just ~12% of examples yields stable gains.

5/

1 week ago 1 0 1 0

This step-specific organization exists across training regimes and is already present in base models. From this perspective, reasoning training doesn't introduce new representational organization - it accelerates convergence toward a termination-related subspace.

4/

1 week ago 1 0 1 0
Post image

Trajectories for correct and incorrect reasoning start out nearly identical but diverge systematically at late steps. Late-step trajectory features predict final-answer correctness with AUC 0.87 - before the answer is generated.

3/

1 week ago 1 0 1 0
Post image

Each reasoning step occupies a distinct region in representation space, and these regions become increasingly linearly separable with layer depth. This structure generalizes across tasks and answer formats.

2/

1 week ago 1 0 1 0
Post image

How do LLMs do CoT reasoning internally?

In our new #ACL2026 paper, we show that reasoning unfolds as a structured trajectory in representation space. Correct and incorrect paths diverge, and we use this to predict correctness before the answer and correct errors mid-flight.

1/

1 week ago 1 0 1 0
Advertisement
Post image

While we find consistent circular VA geometry across Llama and Qwen models, Anthropic concurrently finds similar structure in Claude.

Check our work out! arxiv.org/abs/2604.03147

And thanks to all collaborators: Andrew Lee (@ajyl.bsky.social), Lewen Yan, Xiaoya Lu, Jie Zhang, and Jing Shao.

6/

1 week ago 1 0 0 0
Post image

One possible reason why: consider refusal-related token embeddings (“no”) and compliance tokens (“sure”). Take their mean diff and project onto our VA circle, which lands at 256°: negative in both V and A. Steering in -V or -A promotes the likelihood of refusal tokens!

5/

1 week ago 1 0 1 0
Post image

Somewhat surprisingly, VA axes provide monotonic, bidirectional control over multiple downstream behaviors, including refusal (top row) and sycophancy (bottom row). Arousal is a strong lever - increasing arousal leads to lower refusal rates, while decreasing arousal leads to more refusal.

4/

1 week ago 1 0 1 0
Post image

Unlike Anthropic, we steer along the circular manifold at 0°, 30°, 60°, 90°, etc. This controls the valence and/or arousal level of the model’s outputs, validating that the recovered axes correspond to valence and arousal in a human-interpretable sense.

3/

1 week ago 1 0 1 0

We use mean-diff to extract emotion steering vectors. PCA + ridge regression reveals a circumplex akin to the circumplex model of emotions in human psychology. Projections onto these axes correlate with human-crowdsourced VA ratings across 44k words (valence r=0.71).

2/

1 week ago 1 0 1 0
Post image

💡New paper!
Woke up to Anthropic's emotion paper and realized “wait, that's our finding too.”

We concurrently uncovered a circular valence & arousal (VA) geometry of emotions, steering refusal & sycophancy. We further provide a mechanistic account: tokens occupy distinct regions in this space.

1/

1 week ago 2 0 1 1
Post image

7/
📢 Accepted to #ACL2025 Main Conference! See you in Vienna.
Work done by @1e0sun.bsky.social‬, Chengzhi Mao, @valentinhofmann.bsky.social‬, Xuechunzi Bai.

Paper: arxiv.org/abs/2506.00253
Project page: slhleosun.github.io/aligned_but_...
Code & Data: github.com/slhleosun/al...

10 months ago 4 1 0 0

6/
We call this failure mode "blindness"—when alignment makes certain concepts less salient. This may reflect a broader class of alignment issues.

Similar methods can be extended to other forms of social bias or to study how models resolve polysemy under ambiguity.

10 months ago 2 0 1 0

5/
This challenges a common belief:
unlearning ≠ debiasing

When debiasing strategies suppress sensitive concepts, they can unintentionally reduce a model’s ability to detect bias.

🧠 Instead, we may achieve deeper alignment effects with strategies that make models aware of them.

10 months ago 2 0 1 0
Advertisement
Post image

4/
Inspired by these results, we tested the opposite of “machine unlearning” for debiasing.

What if we reinforced race concepts in models?
- Injecting race-laden activations cut implicit bias by 54.9%.
- LoRA fine-tuning brought it down from 97.3% → 42.4%.

Bonus: also lowered explicit bias.

10 months ago 2 0 1 0
Post image

3/
We mechanistically tested this using activation patching and embedding interpretation.

Aligned models were 52.2% less likely to represent “black” as race in ambiguous contexts compared to unaligned models.

🧠 LMs trained for harmlessness may avoid racial representations—amplifying stereotypes.

10 months ago 2 0 1 0

This resembles race blindness in humans; ignoring race makes stereotypes more likely to slip through, and the LMs’ safety guardrails aren't triggered.

10 months ago 2 0 1 0

2/
So why does alignment increase implicit bias?

Our analyses showed that aligned LMs are more likely to treat “black” and “white” as pure color, not race, when the context is ambiguous.

10 months ago 2 0 1 0
Post image

Aligned models passed explicit tests—but were more biased in implicit settings.
📉 Explicit bias: near 0%
📈 Implicit bias: 91.4%

10 months ago 2 0 1 0

- Explicit: Likert scale, asking whether the model agrees with a given association such as “black” is related to negative, “white” is related to positive.
- Implicit: Word association, let the model freely pair “black”/”white” with positive/negative words.

10 months ago 2 0 1 0

1/
We curated pairs of prompts testing for implicit and explicit racial bias and used them to evaluate Llama 3 models.

10 months ago 2 0 1 0
Post image

🚨New #ACL2025 paper!

Today’s “safe” language models can look unbiased—but alignment can actually make them more biased implicitly by reducing their sensitivity to race-related associations.

🧵Find out more below!

10 months ago 12 2 1 1
Advertisement