We release our benchmark for people to evaluate progress on abstention!
Paper link: arxiv.org/abs/2506.09038
Code link: github.com/facebookrese...
Huge thank you to the best team ever!! Project co-leads @markibrahim.bsky.social and Sam Bell and our advisor Kamalika Chaudhuri!
9/9
Posts by Polina Kirichenko
Our results also align with concurrent work from USC which also observed reasoning LLMs hallucinate on unanswerable math problems!
arxiv.org/abs/2505.13988
More evidence to argue that hallucination and failure to abstain is a big challenge in reasoning LLMs!
8/9
While we find that a carefully crafted system prompt can boost abstention performance, it doesn't fundamentally address the core problem: a lack of reasoning about uncertainty!
See our paper for many more other results!
7/9
We find that very often reasoning models hallucinate missing contexts in the reasoning chain and while sometimes they express uncertainty and the caveats within the reasoning chain, they still produce a confident final answer. We hypothesize this arises from biases in data & rewards in RLVR.
6/9
Moreover, incorporating test-time scaling as in s1 @Muennighoff et al makes things even worse!
Allocating more reasoning budget generally improves accuracy and hurts abstention.
5/9
Remarkably, we find that reasoning post-training hurts (!) abstention performance!
We evaluated the RLVR model from Tulu @natolambert et al, s1 and DeepSeek R1 Distill models and found consistent improvements in accuracy and drops in abstention compared to instruct models.
4/9
We curate 20 uncertainty datasets in different scenarios and evaluate 20 frontier LLMs, and find that most scenarios remain challenging even for the best models!
This allows us to conduct a systematic study of what helps and hurts abstention performance.
3/9
LLMs are great at solving concrete problems, but how well do they handle uncertainty? There are many questions with no direct answer!
We build a diverse benchmark spanning 6 abstention scenarios (underspecification, staleness, …) and various domains (medicine, social bias, …).
Excited to release AbstentionBench -- our paper and benchmark on evaluating LLMs’ *abstention*: the skill of knowing when NOT to answer!
Key finding: reasoning LLMs struggle with unanswerable questions and hallucinate!
Paper: arxiv.org/abs/2506.09038
Code: github.com/facebookrese...
🧵1/9
We also have swag!! Meet the organizers during one of the breaks / informal networking sessions to pick up a sticker :)
Full schedule: sites.google.com/view/cvpr-20...
Accepted papers: sites.google.com/view/cvpr-20...
Join us at #CVPR2025 Demographic Diversity in Computer Vision workshop tomorrow!
📅 Wednesday, June 11, 9am-6pm
📍 room 213 (main session) + Hall D (poster sessions), the Music City Center
We have an amazing lineup of speakers and panelists! Can't wait to meet you all there :)
We are excited to announce a workshop on Demographic Diversity in Computer Vision (DemoDiv) at #CVPR 2025!
Submit your work studying various axes of demographic diversity and fairness in models and datasets and join us in Nashville in June!
Deadline: March 31st
sites.google.com/view/cvpr-20...