Advertisement · 728 × 90

Posts by Tiancheng Hu

Reasoning Boosts Opinion Alignment in LLMs
From Frédéric Berdoz, Yann Billeter, Yann Vonlanthen, Roger Wattenhofer

5 days ago 0 0 0 0

Related ICLR papers:

What Do Large Language Models Know About Opinions?
From Erfan Jahanparast Zhiqing Hong @serinachang5.bsky.social

Benchmarking Overton Pluralism in LLMs
From @elinorpd.bsky.social Jiayi Wu @taylor-sorensen.bsky.social Jiaxin Pei @mbakker.bsky.social

5 days ago 1 0 1 0
Preview
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current ev...

Check out the paper and data for details!
Paper: arxiv.org/abs/2510.17516
Data: huggingface.co/datasets/pit...
Website: simbench.tiancheng.hu
See you Fri, Apr 24, 2026 • 3:15 PM – 5:45 PM @ Pavilion 4 P4-#5206

5 days ago 1 0 1 0
Post image

Another result that really stood out to me and kept me thinking:
Post-training helps LLMs predict consensus opinions, but hurts their ability to model pluralistic disagreement.
So making a model a better chat assistant may make it a worse social simulator. How should we train a good simulator?

5 days ago 0 0 1 0
Post image

Evaluation in this area has been pretty fragmented. If we want LLM social simulation to become scientifically useful, we need a shared way to measure when, how, and why models succeed or fail.
The overall picture is still far from perfect: the best model we tested at release scored 40.8 / 100.

5 days ago 0 0 1 0
Post image

This is the motivation behind SimBench:

a benchmark for group-level human behavior simulation with LLMs.

It brings together 20 datasets spanning moral dilemmas, economic games, psych assessments, and more, so this can be studied in a standardized way rather than through isolated one-off tasks.

5 days ago 1 0 1 0

SimBench now at #ICLR2026!
Often in social simulations, the goal is not to predict what one specific person will do. It is to estimate how a group will respond, whether in pre-testing a real polling question, or in stress-testing a policy or intervention before running it in the real world.

5 days ago 3 1 1 0

It was a great fun working on this with Caiqi @dirkhovy.bsky.social Nigel @cambridgeltl.bsky.social @milanlp.bsky.social

1 month ago 1 0 0 1
Position: Large Language Model Failures from Hallucination to Homogenization Are Different Facets of Miscalibration This position paper argues that diverse failures of Large Language Models (LLMs), from confident hallucinations to collapsed diversity to brittle safety refusals, are best understood as different face...

7/7 As models become ever so widely deployed in the real world, we need to build models that are not just capable, but genuinely reliable.

Let us know what you think! What LLM failure do you think is most underappreciated as a calibration problem?

Paper: www.techrxiv.org/doi/full/10....

1 month ago 1 0 1 0

Deploy — design interfaces that let uncertainty reach the humans acting on model outputs.

1 month ago 1 0 1 0
Advertisement

6/7 Our call to action:

Measure — make calibration metrics standard in benchmark reporting. Leaderboards track accuracy; almost none track calibration.

Train — base models start well-calibrated. We need alignment methods that don't destroy that.

1 month ago 1 0 1 0

5/7 The field treats each failure with a bespoke patch: RAG for hallucination, temperature for diversity, safety classifiers for over-refusal.

But these work around the model's broken uncertainty rather than fixing it. That's why the same failures keep resurfacing in new forms.

1 month ago 1 0 1 0
Post image

4/7 We argue these aren't separate bugs. They're four facets of the same problem:

🔴 Probabilistic — can't match requested distributions
🟠 Semantic — confidence ≠ correctness
🔵 Distributional — output diversity collapse
🟢 Metacognitive — can't assess its own competence

1 month ago 2 1 1 0

3/7 It goes deeper than randomness.

Ask it "what book should I read?" and it defaults to the same WEIRD-centric bestsellers. Ask a nuanced political question and it responds with near-zero variation, hallucinating consensus where none exists.

It can spend 1,000 tokens "reasoning" about 2+3.

1 month ago 1 0 1 0

2/7 Try this: ask any LLM for a random number between 1 and 10.

Models prefer "7" much more often than anything else. It can describe probability correctly when you ask. It just can't do it.

1 month ago 1 0 1 0
Post image

1/7 🧵 The GPT-4 technical report featured detailed calibration curves.

Since then, not a single major model release has reported calibration. The field quietly stopped measuring whether models know what they don't know.

Our new position paper argues this is a mistake. Here's why.

1 month ago 8 2 1 0

A privilege to represent @cambridgeltl.bsky.social @camlangsci.bsky.social @gatescambridge.bsky.social

Huge thanks to the entire team, the Secretariat, the Expert Advisory Panel and all reviewers.

2 months ago 1 0 0 0
Advertisement

A crucial challenge is evaluation: existing evaluation methods do not reliably reflect how systems perform in real-world settings. Nonetheless, with companies investing hundreds of billions to scale up, we expect model capabilities to continue growing.

2 months ago 0 0 1 0

Yet, the frontier remains "jagged": models may still fail on simple tasks, like counting objects in an image. Their performance also tends to decline when prompted in languages other than English, which has major implications for global deployment and fairness.

2 months ago 0 0 1 0

I focused on "Current Capabilities."
We documented rapid advances: AI now achieves Gold-medal performance at the Math Olympiad and agents are increasingly automating useful work, from software engineering to curriculum design.

2 months ago 0 2 1 0

Proud to contribute to the new International AI Safety Report chaired by @YoshuaBengio, with a fantastic international team!
Every word was weighed to ensure a rigorous, evidence-based view of current AI capabilities and the risks they pose.

A short summary of my section below.

2 months ago 0 0 1 0
Post image

I’m pleased to share the Second Key Update to the International AI Safety Report, which outlines how AI developers, researchers, and policymakers are approaching technical risk management for general-purpose AI systems.
(1/6)

4 months ago 29 11 2 0

Personalization certainly needs boundaries and we show how that could look like!

5 months ago 0 0 0 0

Great fun working on this with @bminixhofer.bsky.social and Prof. Collier at @cambridgeltl.bsky.social.

Special thanks to Paul Martin, and Arcee AI's Mergekit library.

5 months ago 1 0 0 0
Preview
Navigating the Alignment-Calibration Trade-off: A Pareto-Superior Frontier via Model Merging The "alignment tax" of post-training is typically framed as a drop in task accuracy. We show it also involves a severe loss of calibration, making models overconfident, less reliable, and model output...

TL;DR: The alignment-calibration trade-off is real, but you don't have to be stuck with the endpoints.

Model merging provides a simple, powerful dial to find the perfect balance of capability and reliability for YOUR application.

Paper here: arxiv.org/abs/2510.17426 (8/8)

5 months ago 0 0 1 0
Post image

Better calibration has benefits beyond accuracy scores. It helps reduce "mode collapse" in generation tasks, leading to more diverse generations (and higher utility too), as measured on NoveltyBench. It improves model performance on group-level simulation tasks too! (7/8)

5 months ago 0 0 1 0
Advertisement
Post image

And it gets better with scale! 📈
The benefits of merging, both the accuracy boost and the stability of the "sweet spot", become even more pronounced in larger, more capable models. This echoes prior work which shows merging bigger models are more effective and stable. (6/8)

5 months ago 0 0 1 0
Post image

The Pareto-superior frontier is a general phenomenon we observe across model families (Gemma, Qwen), sizes, and datasets, where we can consistently find a better-balanced model. We show Qwen 2.5 results on BBH and MMLU-Pro below. (5/8)

5 months ago 0 0 1 0
Post image

It's NOT a zero-sum game between base and instruct.
We find a "sweet spot" merge that is Pareto-superior: it has HIGHER accuracy than both parents while substantially restoring the calibration lost during alignment. (4/8)

5 months ago 0 0 1 0
Post image

Our solution is simple and computationally cheap: model merging.
By interpolating between the well-calibrated base model and its capable but overconfident instruct counterpart, we create a continuous spectrum to navigate this trade-off. No retraining needed.
(3/8)

5 months ago 0 0 1 0