, possibly due to memorization or latent reasoning mechanisms.
💡 Resources
📎 Code: github.com/mainlp/Multi...
📄 Paper: arxiv.org/abs/2510.09555
Posts by Raoyuan Zhao
🔍 Faithfulness: not all thinking traces are equal
Perturbation-based experiments reveal that faithfulness varies across languages and model scales. Non-English languages tend to rely more on explicit thinking traces, while larger models are more robust to perturbations,
Even for identical problems, models frequently produce inconsistent thinking traces across languages. Using crosslingual thinking-trace interchange, we show that reasoning is not semantically aligned, and even identical traces can yield different answers depending on the prompt language.
Interestingly, prompt hacking improves language compliance but often comes at the cost of performance.
🔄 Consistency: reasoning is not aligned across languages
Even for identical problems, models frequently produce inconsistent thinking traces across languages.
📊 Performance: language matters more than we think
We observe strong language preferences in LRMs: performance varies substantially across thinking languages. High-resource languages (like English) consistently perform better and both prompt language and thinking language jointly shape outcomes.
In our new paper, "A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages", we go beyond final-answer accuracy to analyze multilingual reasoning along three dimensions: performance, consistency, and faithfulness.
📝 What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
👥 @mhedderich.bsky.social Anyi Wang @raoyuan.bsky.social @florian-eichin.com Jonas Fischer @barbaraplank.bsky.social
🔗 arxiv.org/abs/2504.158...
📁Main - Long
What changes if you take the LLM prompt “Tell me a short story about Dr. Li” and replace “Dr. Li” with “Dr. Smith”?
Would you have guessed that this introduces a massive gender bias, from ca. half/half to 99% male doctors?
In our #ACL2025 paper we present the Spotlight framework which...
Caught some great moments at #MCML Munich AI Day 2025 last week📍
From sharp keynotes to poster debates. Our team had the chance to show some recent work, join the conversations, and bring back plenty of food for thought🧠🗣️📊
Dei Boarisch heard ned bei "Servus" und "Pfiade" auf? Dann suach ma genau Di!
Wir suachan Bairischsprecher:innen, de a kurze Umfrage über KI-generierds Boarisch für a Masterarbeit beantwortn mechadn.
Mid jeder Teilnahme bring ma den boarischn Dialekt a Stickal weida in de digitale Weyd!
Want to know if your prompting is also affected by this? Addressing this and other issues systematically, we proposed Spotlight, which utilizes data mining to uncover the effects of prompt- and model-changes (meet us at ACL to discuss)
arxiv.org/abs/2504.15815