Polina Kirichenko (@polkirichenko) Bsky

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which...

We release our benchmark for people to evaluate progress on abstention!
Paper link: arxiv.org/abs/2506.09038
Code link: github.com/facebookrese...

Huge thank you to the best team ever!! Project co-leads @markibrahim.bsky.social and Sam Bell and our advisor Kamalika Chaudhuri!

9/9

10 months ago 0 0 0 0

The Hallucination Tax of Reinforcement Finetuning Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplor...

Our results also align with concurrent work from USC which also observed reasoning LLMs hallucinate on unanswerable math problems!
arxiv.org/abs/2505.13988
More evidence to argue that hallucination and failure to abstain is a big challenge in reasoning LLMs!

8/9

10 months ago 0 0 1 0

While we find that a carefully crafted system prompt can boost abstention performance, it doesn't fundamentally address the core problem: a lack of reasoning about uncertainty!
See our paper for many more other results!

7/9

10 months ago 0 0 1 0

We find that very often reasoning models hallucinate missing contexts in the reasoning chain and while sometimes they express uncertainty and the caveats within the reasoning chain, they still produce a confident final answer. We hypothesize this arises from biases in data & rewards in RLVR.

6/9

10 months ago 0 0 1 0

Moreover, incorporating test-time scaling as in s1 @Muennighoff et al makes things even worse!
Allocating more reasoning budget generally improves accuracy and hurts abstention.

5/9

10 months ago 0 0 1 0

Remarkably, we find that reasoning post-training hurts (!) abstention performance!
We evaluated the RLVR model from Tulu @natolambert et al, s1 and DeepSeek R1 Distill models and found consistent improvements in accuracy and drops in abstention compared to instruct models.

4/9

10 months ago 0 0 1 0

We curate 20 uncertainty datasets in different scenarios and evaluate 20 frontier LLMs, and find that most scenarios remain challenging even for the best models!
This allows us to conduct a systematic study of what helps and hurts abstention performance.

3/9

10 months ago 0 0 1 0

LLMs are great at solving concrete problems, but how well do they handle uncertainty? There are many questions with no direct answer!
We build a diverse benchmark spanning 6 abstention scenarios (underspecification, staleness, …) and various domains (medicine, social bias, …).

10 months ago 1 0 1 0

Excited to release AbstentionBench -- our paper and benchmark on evaluating LLMs’ *abstention*: the skill of knowing when NOT to answer!

Key finding: reasoning LLMs struggle with unanswerable questions and hallucinate!

Paper: arxiv.org/abs/2506.09038
Code: github.com/facebookrese...
🧵1/9

10 months ago 0 0 1 1

We also have swag!! Meet the organizers during one of the breaks / informal networking sessions to pick up a sticker :)

Full schedule: sites.google.com/view/cvpr-20...
Accepted papers: sites.google.com/view/cvpr-20...

10 months ago 0 0 0 0

Join us at #CVPR2025 Demographic Diversity in Computer Vision workshop tomorrow!
📅 Wednesday, June 11, 9am-6pm
📍 room 213 (main session) + Hall D (poster sessions), the Music City Center
We have an amazing lineup of speakers and panelists! Can't wait to meet you all there :)

10 months ago 0 0 1 0

We are excited to announce a workshop on Demographic Diversity in Computer Vision (DemoDiv) at #CVPR 2025!

Submit your work studying various axes of demographic diversity and fairness in models and datasets and join us in Nashville in June!
Deadline: March 31st
sites.google.com/view/cvpr-20...

1 year ago 3 0 0 0

Posts by Polina Kirichenko