Raoyuan Zhao (@raoyuan) Bsky

GitHub - mainlp/Multilingual-CoT-Evaluation Contribute to mainlp/Multilingual-CoT-Evaluation development by creating an account on GitHub.

, possibly due to memorization or latent reasoning mechanisms.

💡 Resources

📎 Code: github.com/mainlp/Multi...
📄 Paper: arxiv.org/abs/2510.09555

4 weeks ago 1 1 0 0

GitHub - mainlp/Multilingual-CoT-Evaluation Contribute to mainlp/Multilingual-CoT-Evaluation development by creating an account on GitHub.

🔍 Faithfulness: not all thinking traces are equal

Perturbation-based experiments reveal that faithfulness varies across languages and model scales. Non-English languages tend to rely more on explicit thinking traces, while larger models are more robust to perturbations,

4 weeks ago 1 0 1 0

Even for identical problems, models frequently produce inconsistent thinking traces across languages. Using crosslingual thinking-trace interchange, we show that reasoning is not semantically aligned, and even identical traces can yield different answers depending on the prompt language.

4 weeks ago 1 1 1 0

Interestingly, prompt hacking improves language compliance but often comes at the cost of performance.

🔄 Consistency: reasoning is not aligned across languages

Even for identical problems, models frequently produce inconsistent thinking traces across languages.

4 weeks ago 1 0 1 0

GitHub - mainlp/Multilingual-CoT-Evaluation Contribute to mainlp/Multilingual-CoT-Evaluation development by creating an account on GitHub.

📊 Performance: language matters more than we think

We observe strong language preferences in LRMs: performance varies substantially across thinking languages. High-resource languages (like English) consistently perform better and both prompt language and thinking language jointly shape outcomes.

4 weeks ago 1 0 1 0

In our new paper, "A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages", we go beyond final-answer accuracy to analyze multilingual reasoning along three dimensions: performance, consistency, and faithfulness.

4 weeks ago 5 1 1 1

📝 What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
👥 @mhedderich.bsky.social Anyi Wang @raoyuan.bsky.social @florian-eichin.com Jonas Fischer @barbaraplank.bsky.social  
🔗 arxiv.org/abs/2504.158... 
📁Main - Long

8 months ago 2 1 1 0

What changes if you take the LLM prompt “Tell me a short story about Dr. Li” and replace “Dr. Li” with “Dr. Smith”?

Would you have guessed that this introduces a massive gender bias, from ca. half/half to 99% male doctors?

  In our #ACL2025 paper we present the Spotlight framework which...

9 months ago 9 2 1 0

Caught some great moments at #MCML Munich AI Day 2025 last week📍
From sharp keynotes to poster debates. Our team had the chance to show some recent work, join the conversations, and bring back plenty of food for thought🧠🗣️📊

9 months ago 7 2 0 0

Dei Boarisch heard ned bei "Servus" und "Pfiade" auf? Dann suach ma genau Di!
Wir suachan Bairischsprecher:innen, de a kurze Umfrage über KI-generierds Boarisch für a Masterarbeit beantwortn mechadn.
Mid jeder Teilnahme bring ma den boarischn Dialekt a Stickal weida in de digitale Weyd!

10 months ago 6 3 0 1

Want to know if your prompting is also affected by this? Addressing this and other issues systematically, we proposed Spotlight, which utilizes data mining to uncover the effects of prompt- and model-changes (meet us at ACL to discuss)
arxiv.org/abs/2504.15815

10 months ago 7 3 0 0

Posts by Raoyuan Zhao