📢 Accepted to #ACL2026 Main Conference! See you in San Diego.
Thanks to all collaborators at Microsoft: Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan.
Paper: arxiv.org/pdf/2604.05655
Website: slhleosun.github.io/reasoning_tr...
Code: github.com/slhleosun/re...
8/
Posts by Lihao Sun
Reasoning length is also controllable. Steering hidden states toward the termination subspace shortens reasoning; steering away extends it. At moderate strengths this works as a smooth knob with minimal accuracy changes - push too hard and the model enters repetitive loops.
7/
We further introduce trajectory-based steering: when an ongoing trajectory drifts from the ideal path, we apply low-rank corrections to nudge it back. Most effective on harder math problems - +7.6pp on 6- and 7-step questions, with 97%+ preservation of correct solutions.
6/
Unconditionally injecting tokens like "Wait" often hurts accuracy (up to -36%, especially for fewer-step problems). Instead, intervene only when failure is predicted! Using our mid-reasoning correctness signal on just ~12% of examples yields stable gains.
5/
This step-specific organization exists across training regimes and is already present in base models. From this perspective, reasoning training doesn't introduce new representational organization - it accelerates convergence toward a termination-related subspace.
4/
Trajectories for correct and incorrect reasoning start out nearly identical but diverge systematically at late steps. Late-step trajectory features predict final-answer correctness with AUC 0.87 - before the answer is generated.
3/
Each reasoning step occupies a distinct region in representation space, and these regions become increasingly linearly separable with layer depth. This structure generalizes across tasks and answer formats.
2/
How do LLMs do CoT reasoning internally?
In our new #ACL2026 paper, we show that reasoning unfolds as a structured trajectory in representation space. Correct and incorrect paths diverge, and we use this to predict correctness before the answer and correct errors mid-flight.
1/
While we find consistent circular VA geometry across Llama and Qwen models, Anthropic concurrently finds similar structure in Claude.
Check our work out! arxiv.org/abs/2604.03147
And thanks to all collaborators: Andrew Lee (@ajyl.bsky.social), Lewen Yan, Xiaoya Lu, Jie Zhang, and Jing Shao.
6/
One possible reason why: consider refusal-related token embeddings (“no”) and compliance tokens (“sure”). Take their mean diff and project onto our VA circle, which lands at 256°: negative in both V and A. Steering in -V or -A promotes the likelihood of refusal tokens!
5/
Somewhat surprisingly, VA axes provide monotonic, bidirectional control over multiple downstream behaviors, including refusal (top row) and sycophancy (bottom row). Arousal is a strong lever - increasing arousal leads to lower refusal rates, while decreasing arousal leads to more refusal.
4/
Unlike Anthropic, we steer along the circular manifold at 0°, 30°, 60°, 90°, etc. This controls the valence and/or arousal level of the model’s outputs, validating that the recovered axes correspond to valence and arousal in a human-interpretable sense.
3/
We use mean-diff to extract emotion steering vectors. PCA + ridge regression reveals a circumplex akin to the circumplex model of emotions in human psychology. Projections onto these axes correlate with human-crowdsourced VA ratings across 44k words (valence r=0.71).
2/
💡New paper!
Woke up to Anthropic's emotion paper and realized “wait, that's our finding too.”
We concurrently uncovered a circular valence & arousal (VA) geometry of emotions, steering refusal & sycophancy. We further provide a mechanistic account: tokens occupy distinct regions in this space.
1/
7/
📢 Accepted to #ACL2025 Main Conference! See you in Vienna.
Work done by @1e0sun.bsky.social, Chengzhi Mao, @valentinhofmann.bsky.social, Xuechunzi Bai.
Paper: arxiv.org/abs/2506.00253
Project page: slhleosun.github.io/aligned_but_...
Code & Data: github.com/slhleosun/al...
6/
We call this failure mode "blindness"—when alignment makes certain concepts less salient. This may reflect a broader class of alignment issues.
Similar methods can be extended to other forms of social bias or to study how models resolve polysemy under ambiguity.
5/
This challenges a common belief:
unlearning ≠ debiasing
When debiasing strategies suppress sensitive concepts, they can unintentionally reduce a model’s ability to detect bias.
🧠 Instead, we may achieve deeper alignment effects with strategies that make models aware of them.
4/
Inspired by these results, we tested the opposite of “machine unlearning” for debiasing.
What if we reinforced race concepts in models?
- Injecting race-laden activations cut implicit bias by 54.9%.
- LoRA fine-tuning brought it down from 97.3% → 42.4%.
Bonus: also lowered explicit bias.
3/
We mechanistically tested this using activation patching and embedding interpretation.
Aligned models were 52.2% less likely to represent “black” as race in ambiguous contexts compared to unaligned models.
🧠 LMs trained for harmlessness may avoid racial representations—amplifying stereotypes.
This resembles race blindness in humans; ignoring race makes stereotypes more likely to slip through, and the LMs’ safety guardrails aren't triggered.
2/
So why does alignment increase implicit bias?
Our analyses showed that aligned LMs are more likely to treat “black” and “white” as pure color, not race, when the context is ambiguous.
Aligned models passed explicit tests—but were more biased in implicit settings.
📉 Explicit bias: near 0%
📈 Implicit bias: 91.4%
- Explicit: Likert scale, asking whether the model agrees with a given association such as “black” is related to negative, “white” is related to positive.
- Implicit: Word association, let the model freely pair “black”/”white” with positive/negative words.
1/
We curated pairs of prompts testing for implicit and explicit racial bias and used them to evaluate Llama 3 models.
🚨New #ACL2025 paper!
Today’s “safe” language models can look unbiased—but alignment can actually make them more biased implicitly by reducing their sensitivity to race-related associations.
🧵Find out more below!