Reasoning Boosts Opinion Alignment in LLMs
From Frédéric Berdoz, Yann Billeter, Yann Vonlanthen, Roger Wattenhofer
Posts by Tiancheng Hu
Related ICLR papers:
What Do Large Language Models Know About Opinions?
From Erfan Jahanparast Zhiqing Hong @serinachang5.bsky.social
Benchmarking Overton Pluralism in LLMs
From @elinorpd.bsky.social Jiayi Wu @taylor-sorensen.bsky.social Jiaxin Pei @mbakker.bsky.social
Check out the paper and data for details!
Paper: arxiv.org/abs/2510.17516
Data: huggingface.co/datasets/pit...
Website: simbench.tiancheng.hu
See you Fri, Apr 24, 2026 • 3:15 PM – 5:45 PM @ Pavilion 4 P4-#5206
Another result that really stood out to me and kept me thinking:
Post-training helps LLMs predict consensus opinions, but hurts their ability to model pluralistic disagreement.
So making a model a better chat assistant may make it a worse social simulator. How should we train a good simulator?
Evaluation in this area has been pretty fragmented. If we want LLM social simulation to become scientifically useful, we need a shared way to measure when, how, and why models succeed or fail.
The overall picture is still far from perfect: the best model we tested at release scored 40.8 / 100.
This is the motivation behind SimBench:
a benchmark for group-level human behavior simulation with LLMs.
It brings together 20 datasets spanning moral dilemmas, economic games, psych assessments, and more, so this can be studied in a standardized way rather than through isolated one-off tasks.
SimBench now at #ICLR2026!
Often in social simulations, the goal is not to predict what one specific person will do. It is to estimate how a group will respond, whether in pre-testing a real polling question, or in stress-testing a policy or intervention before running it in the real world.
It was a great fun working on this with Caiqi @dirkhovy.bsky.social Nigel @cambridgeltl.bsky.social @milanlp.bsky.social
7/7 As models become ever so widely deployed in the real world, we need to build models that are not just capable, but genuinely reliable.
Let us know what you think! What LLM failure do you think is most underappreciated as a calibration problem?
Paper: www.techrxiv.org/doi/full/10....
Deploy — design interfaces that let uncertainty reach the humans acting on model outputs.
6/7 Our call to action:
Measure — make calibration metrics standard in benchmark reporting. Leaderboards track accuracy; almost none track calibration.
Train — base models start well-calibrated. We need alignment methods that don't destroy that.
5/7 The field treats each failure with a bespoke patch: RAG for hallucination, temperature for diversity, safety classifiers for over-refusal.
But these work around the model's broken uncertainty rather than fixing it. That's why the same failures keep resurfacing in new forms.
4/7 We argue these aren't separate bugs. They're four facets of the same problem:
🔴 Probabilistic — can't match requested distributions
🟠 Semantic — confidence ≠ correctness
🔵 Distributional — output diversity collapse
🟢 Metacognitive — can't assess its own competence
3/7 It goes deeper than randomness.
Ask it "what book should I read?" and it defaults to the same WEIRD-centric bestsellers. Ask a nuanced political question and it responds with near-zero variation, hallucinating consensus where none exists.
It can spend 1,000 tokens "reasoning" about 2+3.
2/7 Try this: ask any LLM for a random number between 1 and 10.
Models prefer "7" much more often than anything else. It can describe probability correctly when you ask. It just can't do it.
1/7 🧵 The GPT-4 technical report featured detailed calibration curves.
Since then, not a single major model release has reported calibration. The field quietly stopped measuring whether models know what they don't know.
Our new position paper argues this is a mistake. Here's why.
A privilege to represent @cambridgeltl.bsky.social @camlangsci.bsky.social @gatescambridge.bsky.social
Huge thanks to the entire team, the Secretariat, the Expert Advisory Panel and all reviewers.
A crucial challenge is evaluation: existing evaluation methods do not reliably reflect how systems perform in real-world settings. Nonetheless, with companies investing hundreds of billions to scale up, we expect model capabilities to continue growing.
Yet, the frontier remains "jagged": models may still fail on simple tasks, like counting objects in an image. Their performance also tends to decline when prompted in languages other than English, which has major implications for global deployment and fairness.
I focused on "Current Capabilities."
We documented rapid advances: AI now achieves Gold-medal performance at the Math Olympiad and agents are increasingly automating useful work, from software engineering to curriculum design.
Proud to contribute to the new International AI Safety Report chaired by @YoshuaBengio, with a fantastic international team!
Every word was weighed to ensure a rigorous, evidence-based view of current AI capabilities and the risks they pose.
A short summary of my section below.
I’m pleased to share the Second Key Update to the International AI Safety Report, which outlines how AI developers, researchers, and policymakers are approaching technical risk management for general-purpose AI systems.
(1/6)
Personalization certainly needs boundaries and we show how that could look like!
Great fun working on this with @bminixhofer.bsky.social and Prof. Collier at @cambridgeltl.bsky.social.
Special thanks to Paul Martin, and Arcee AI's Mergekit library.
TL;DR: The alignment-calibration trade-off is real, but you don't have to be stuck with the endpoints.
Model merging provides a simple, powerful dial to find the perfect balance of capability and reliability for YOUR application.
Paper here: arxiv.org/abs/2510.17426 (8/8)
Better calibration has benefits beyond accuracy scores. It helps reduce "mode collapse" in generation tasks, leading to more diverse generations (and higher utility too), as measured on NoveltyBench. It improves model performance on group-level simulation tasks too! (7/8)
And it gets better with scale! 📈
The benefits of merging, both the accuracy boost and the stability of the "sweet spot", become even more pronounced in larger, more capable models. This echoes prior work which shows merging bigger models are more effective and stable. (6/8)
The Pareto-superior frontier is a general phenomenon we observe across model families (Gemma, Qwen), sizes, and datasets, where we can consistently find a better-balanced model. We show Qwen 2.5 results on BBH and MMLU-Pro below. (5/8)
It's NOT a zero-sum game between base and instruct.
We find a "sweet spot" merge that is Pareto-superior: it has HIGHER accuracy than both parents while substantially restoring the calibration lost during alignment. (4/8)
Our solution is simple and computationally cheap: model merging.
By interpolating between the well-calibrated base model and its capable but overconfident instruct counterpart, we create a continuous spectrum to navigate this trade-off. No retraining needed.
(3/8)