This is a great follow-up to our recent preprint! This small-scale evaluation introduces a framing-resistant prompt and makes a step toward exploring the mitigation space for the framing sensitivity problem.
Posts by Hye Sun Yun
Thanks! I really enjoyed the write up of your evaluation work. I definitely agree that the evaluator model and even the evaluation
prompt matters a lot. The framing-resistant prompting was interesting and is a great start to finding mitigations for this issue!
Good point! We didn't evaluate on what we would say to be "easy" but tried to simulate a task that may be closer to real-world setting to empirically show how framing effect can impact patients directly
I would like to thank my amazing co-authors!
Geetika Kapoor, @mackert.bsky.social, @ramezkouzy.bsky.social, @cocoweixu.bsky.social, @jessyjli.bsky.social, and @byron.bsky.social. [6/6]
Please check out our full findings here: arxiv.org/abs/2604.05051
Our conclusion: LLM medical responses vary based on question phrasing alone, despite identical underlying evidence. For patients and consumers, how you ask may determine what you're told. [5/6]
We also compared using technical terms vs plain language terms in our questions. However, we didn’t find any meaningful differences in this language style. [4/6]
This framing effect is further amplified in multi-turn conversations, where sustained persuasion increases inconsistency. [3/6]
"Does this work?" vs "Does this not work?” Are conclusions different even though the LLM was given the same evidence documents?
Yes. Positive vs negative framing leads to more contradictory conclusions than responses from the positive question sampled twice. [2/6]
Patients ask LLMs medical questions — but how they phrase it matters more than it should.
Our new preprint explores how different phrasings of patient health questions can lead to inconsistent conclusions, even with the same evidence. [1/6]
Full Paper: arxiv.org/abs/2604.05051
Thrilled to share our research showing how LLM models can be influenced by bias from "spun" medical literature is now featured in Northeastern's Khoury news! This shows critical insights as AI enters healthcare.
The full paper can be found at arxiv.org/abs/2502.07963
I am at CHIL this week to present my poster (Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?) on Thursday June 26.
Looking forward to connecting and sharing our work on spin with the CHIL community!
I am at CHI this week to present my poster (Framing Health Information: The Impact of Search Methods and Source Types on User Trust and Satisfaction in the Age of LLMs) on Wednesday April 30
CHI Program Link: programs.sigchi.org/chi/2025/pro...
Looking forward to connecting with you all!
LLM-based chatbots are changing how people search for health information—but how do users perceive their quality and trustworthiness compared to other online sources? Our survey study explores these questions. Check it out! www.jmir.org/2025/1/e68560
I'm searching for some comp/ling experts to provide a precise definition of “slop” as it refers to text (see: corp.oup.com/word-of-the-...)
I put together a google form that should take no longer than 10 minutes to complete: forms.gle/oWxsCScW3dJU...
If you can help, I'd appreciate your input! 🙏
Huge thanks to my amazing co-authors!
Karen Y.C. Zhang, @ramezkouzy.bsky.social, @ijmarshall.bsky.social, @jessyjli.bsky.social, & @byron.bsky.social [7/7]
Check out our full findings here: arxiv.org/abs/2502.07963
Can we fix this? We tested zero-shot prompts to reduce LLMs' susceptibility to spin.
Good news: prompts that encouraged reasoning reduced their tendency to overstate trial results! 🛠️
Careful design is key to improving evidence synthesis for clinical decisions. [6/7]
When we asked LLMs to simplify abstracts into plain language, they often propagated spin into their summaries. This means LLMs could unintentionally mislead patients and non-experts about the effectiveness of treatments. 😱 [5/7]
We asked LLMs how favorably they perceived a treatment’s results (0-10 scale). Even though LLMs could detect spin, they were far more influenced by it than human experts.
Meaning: LLMs believed spun abstracts presented more favorable results! 😬 [4/7]
When we prompted 22 LLMs to identify spin in medical abstracts, we found that they were moderately to strongly capable of detecting spin.
However, things got interesting when we asked LLMs to interpret the results… [3/7]
🔽
So what is spin?
Spin refers to reporting strategies that make experimental treatments appear more beneficial than they actually are—often distracting from nonsignificant results.
Example:
❌ “The treatment shows a promising trend toward significance…”
✅ “No significant difference was found.”
[2/7]
🚨 Do LLMs fall for spin in medical literature? 🤔
In our new preprint, we find that LLMs are susceptible to biased reporting of clinical treatment benefits in abstracts—more so than human experts. 📄🔍 [1/7]
Full Paper: arxiv.org/abs/2502.07963
🧵👇
As someone interested in an academic position post-PhD, I found this post very helpful. Thank you for sharing your wisdom and advice.
Awesome! Thank you
Thank you!
The application form says it is no longer accepting responses. Is the application closed now?