Advertisement · 728 × 90

Posts by Jessy Li

This Treatment Works, Right? Testing Framing Resistance in Medical QA A rapid replication testing whether a framing-resistant prompt can mitigate LLM sensitivity to question phrasing in medical contexts.

Nice research! You may be interested in the small scale ($4 budget) verification performed by my personal Opus agent here: muninn.austegard.com/blog/this-tr... in which we also introduced a framing-resistant prompt to see how much that would mitigate the effetcs. 1/3

1 week ago 3 1 2 1

Haha good point! Yes — here we actually do contrast with variations from just the same prompt — that’s actually our baseline condition!

1 week ago 1 0 0 0

If you ask the same question with different framing/phrasing, do language models change their answers? This is super important in medicine because different info can have real consequences! Check out this new work from @hyesunyun.bsky.social

2 weeks ago 5 0 1 0
Post image Post image

Heading to #EACL2026! 🇲🇦

Friday 11a Poster Session 6: LMs struggle to perform inferences around discourse connectives aclanthology.org/2026.eacl-lo...

Sunday 5p TeachingNLP workshop: new course on discourse+generation aclanthology.org/2026.teachin...

w/ @kanishka.bsky.social + Daniel Brubaker

4 weeks ago 4 2 0 0

Want to know how well the models can brainstorm connections across different concepts? Super excited about @manyawadhwa.bsky.social’s work on measuring associative creativity!

1 month ago 2 0 0 0
title section of the paper: “Cross-Modal Taxonomic Generalization in (Vision) Language Models” by Tianyang Xu, Marcelo Sandoval-Castañeda, Karen Livescu, Greg Shakhnarovich, Kanishka Misra.

title section of the paper: “Cross-Modal Taxonomic Generalization in (Vision) Language Models” by Tianyang Xu, Marcelo Sandoval-Castañeda, Karen Livescu, Greg Shakhnarovich, Kanishka Misra.

What is the interplay between representations learned from (language) surface forms alone, and those learned from more grounded evidence (e.g.,vision)?

Excited to share new work understanding “Cross-modal taxonomic generalization” in (V)LMs

arxiv.org/abs/2603.07474

1/

1 month ago 34 12 1 1

Check out our special theme: new missions for NLP research!

1 month ago 12 5 1 1
Title card of our paper: "Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs" by Junyi Jessy Li, Yang Janet Liu, Valentina Pyatkin, and William Sheffield.

Title card of our paper: "Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs" by Junyi Jessy Li, Yang Janet Liu, Valentina Pyatkin, and William Sheffield.

Nearly 2 years ago, @jessyjli.bsky.social, @janetlauyeung.bsky.social, @valentinapy.bsky.social, and I decided that it's time to bring discourse structure to the center of NLP teaching.

2 months ago 11 3 2 0
Advertisement

Check out @asher-zheng.bsky.social's work on quantifying strategic language in dialogue, just appeared in the Dialogue and Discourse journal.
We study non-cooperative moves that are subtle to capture, where modern AI still have trouble comprehending.
Work w/ David_Beaver

2 months ago 6 0 0 0
Title page of our paper: "Bears, all bears, and some bears. Language Constraints on Language Models' Inductive Inferences"

Title page of our paper: "Bears, all bears, and some bears. Language Constraints on Language Models' Inductive Inferences"

“All bears have a property”, “Some bears have a property”, “Bears have a property” are different in terms of how the property is generalized to a specific bear – a great example of how language constrains thought!

This holds for kids, adults, and according to our new work, (V)LMs! 🧵

2 months ago 26 10 1 2

🚨Be careful with LLMs when you ask health related questions -- even when the model relies on "evidence"! Kaijie's paper reveals a key weakness and the tricky balance between safety and faithfulness 👉

3 months ago 2 0 0 0

Accepted at EACL - excited about Morocco!

3 months ago 5 1 0 0
Screenshot of a figure with two panels, labeled (a) and (b). The caption reads: "Figure 1: (a) Illustration of messages (left) and strings (right) in toy domain. Blue = grammatical strings. Red = ungrammatical strings. (b) Surprisal (negative log probability) assigned to toy strings by GPT-2."

Screenshot of a figure with two panels, labeled (a) and (b). The caption reads: "Figure 1: (a) Illustration of messages (left) and strings (right) in toy domain. Blue = grammatical strings. Red = ungrammatical strings. (b) Surprisal (negative log probability) assigned to toy strings by GPT-2."

New work to appear @ TACL!

Language models (LMs) are remarkably good at generating novel well-formed sentences, leading to claims that they have mastered grammar.

Yet they often assign higher probability to ungrammatical strings than to grammatical strings.

How can both things be true? 🧵👇

5 months ago 92 21 2 3

Incredibly honored to serve as #EMNLP 2026 Program Chair along with @sunipadev.bsky.social and Hung-yi Lee, and General Chair @andre-t-martins.bsky.social. Looking forward to Budapest!!

(With thanks to Lisa Chuyuan Li who took this photo in Suzhou!)

5 months ago 18 2 0 0

Delighted Sasha's (first year PhD!) work using mech interp to study complex syntax constructions won an Outstanding Paper Award at EMNLP!

Also delighted the ACL community continues to recognize unabashedly linguistic topics like filler-gaps... and the huge potential for LMs to inform such topics!

5 months ago 33 8 1 0

Think your LLMs “understand” words like although/but/therefore? Think again!

They perform at chance for making inferences from certain discourse connectives expressing concession

6 months ago 19 3 0 0
Advertisement
Preview
PLSemanticsBench: Large Language Models As Programming Language Interpreters As large language models (LLMs) excel at code reasoning, a natural question arises: can an LLM execute programs (i.e., act as an interpreter) purely based on a programming language's formal semantics?...

Test your models and see if they just memorize or truly understand!

PLSemanticsBench - where formal meets informal!

arxiv.org/abs/2510.03415

Team: Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Milos Gligoric

6 months ago 2 0 0 0

So what's really happening⁉️
LLMs aren't interpreting rules -- they're recalling patterns.
Their "understanding" is promising... but shallow.

💡It's time to test semantics, not just syntax.💡
To move from surface-level memorization → true symbolic reasoning.

6 months ago 7 0 1 0

Change the rules -- swap (+ with -) or replace (+ with novel symbols) operators -- and accuracy collapses.
Models that were "near-perfect" drop to single digits. 😬

6 months ago 5 1 1 0
Post image Post image

🚨 Does your LLM really understand code -- or is it just really good at remembering it?
We built **PLSemanticsBench** to find out.
The results: a wild mix.

✅The Brilliant:
Top reasoning models can execute complex, fuzzer-generated programs -- even with 5+ levels of nested loops! 🤯

❌The Brittle: 🧵

6 months ago 29 6 1 3
Post image

Find my students and collaborators at COLM this week!

Tuesday morning: @juand-r.bsky.social and @ramyanamuduri.bsky.social 's papers (find them if you missed it!)

Wednesday pm: @manyawadhwa.bsky.social 's EvalAgent

Thursday am: @anirudhkhatry.bsky.social 's CRUST-Bench oral spotlight + poster

6 months ago 9 5 0 1

We’re hiring faculty as well! Happy to talk about it at COLM!

6 months ago 9 2 0 0

Can we quantify what makes some text read like AI "slop"? We tried 👇

6 months ago 8 1 0 0
Advertisement
Preview
Language Models Fail to Introspect About Their Knowledge of Language There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of s...

I’m at #COLM2025 from Wed with:

@siyuansong.bsky.social Tue am introspection arxiv.org/abs/2503.07513

@qyao.bsky.social Wed am controlled rearing: arxiv.org/abs/2503.20850

@sashaboguraev.bsky.social INTERPLAY ling interp: arxiv.org/abs/2505.16002

I’ll talk at INTERPLAY too. Come say hi!

6 months ago 20 6 1 0
Post image

On my way to #COLM2025 🍁

Check out jessyli.com/colm2025

QUDsim: Discourse templates in LLM stories arxiv.org/abs/2504.09373

EvalAgent: retrieval-based eval targeting implicit criteria arxiv.org/abs/2504.15219

RoboInstruct: code generation for robotics with simulators arxiv.org/abs/2405.20179

6 months ago 12 4 0 0
Preview
Both Direct and Indirect Evidence Contribute to Dative Alternation Preferences in Language Models Language models (LMs) tend to show human-like preferences on a number of syntactic phenomena, but the extent to which these are attributable to direct exposure to the phenomena or more general propert...

Traveling to my first @colmweb.org🍁

Not presenting anything but here are two posters you should visit:

1. @qyao.bsky.social on Controlled rearing for direct and indirect evidence for datives (w/ me, @weissweiler.bsky.social and @kmahowald.bsky.social), W morning

Paper: arxiv.org/abs/2503.20850

6 months ago 13 5 1 0

Here is a genuine one :) CosmicAI’s AstroVisBench, to appear at #NeurIPS

bsky.app/profile/nsfs...

6 months ago 2 1 0 0

All of us (@kanishka.bsky.social @kmahowald.bsky.social and me) are looking for PhD students this cycle! If computational linguistics/NLP is your passion, join us at UT Austin!

For my areas see jessyli.com

6 months ago 4 5 0 0

Can AI aid scientists amidst their own workflows, when they do not know step-by-step workflows and may not know, in advance, the kinds of scientific utility a visualization would bring?

Check out @sebajoe.bsky.social’s feature on ✨AstroVisBench:

6 months ago 8 3 0 0
Video

📣 NEW HCTS course developed in collaboration with @tephi-tx.bsky.social: AI in Health Communication 📣

Explore responsible applications and best practices for maximizing impact and building trust with @utaustin.bsky.social experts @jessyjli.bsky.social & @mackert.bsky.social.

💻: rebrand.ly/HCTS_AI

7 months ago 2 1 0 1