David Mortensen (@davidrmortensen) Bsky

Performance of a sweep of models on Oolong-synth and Oolong-real. Performance decreases with increasing context length, sometimes steeply.

Can LLMs accurately aggregate information over long, information-dense texts? Not yet…

We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!

5 months ago 50 20 3 3

🚨New paper: Reward Models (RMs) are used to align LLMs, but can they be steered toward user-specific value/style preferences?
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵

6 months ago 12 7 1 0

🚨New Paper: LLM developers aim to align models with values like helpfulness or harmlessness. But when these conflict, which values do models choose to support? We introduce ConflictScope, a fully-automated evaluation pipeline that reveals how models rank values under conflict.
(📷 xkcd)

6 months ago 15 4 1 3

🔈When LLMs solve tasks with a mid-to-low resource input or target language, their output quality is poor. We know that. But can we put our finger on what breaks inside the LLM? We introduce the 💥 translation barrier hypothesis 💥 for failed multilingual generation with LLMs. arxiv.org/abs/2506.22724

9 months ago 26 7 2 1

Thrilled to share that this is out in @pnas.org today! 🎉

We show that linguistic generalization in language models can be due to underlying analogical mechanisms.

Shoutout to my amazing co-authors @weissweiler.bsky.social, @davidrmortensen.bsky.social, Hinrich Schütze, and Janet Pierrehumbert!

11 months ago 35 6 1 2

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs:

🧵1/9

10 months ago 72 21 2 2

RL boosts LLM reasoning—but why stop at math & code? 🤔
Meet Nemotron-CrossThink—a method to scale RL-based self-learning across law, physics, social science & more.

🔥Resulting in a model that reasons broadly, adapts dynamically, & uses 28% fewer tokens for correct answers!
🧵↓

11 months ago 5 3 1 0

On my way to #NAACL2025 where I'll give a keynote at the noisy text workshop (WNUT), presenting some of the challenges & methods for dialect NLP + also discussing dialect speakers' perspectives!

🗨️ Beyond “noisy” text: How (and why) to process dialect data
🗓️ Saturday, May 3, 9:30–10:30

11 months ago 27 7 1 1

Excited to announce our #NAACL2025 Oral paper! 🎉✨

We carried out the largest systematic study so far to map the links between upstream choices, intrinsic bias, and downstream zero-shot performance across 131 CLIP Vision-language encoders, 26 datasets, and 55 architectures!

11 months ago 21 6 1 0

Can self-supervised models 🤖 understand allophony 🗣? Excited to share my new #NAACL2025 paper: Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment arxiv.org/abs/2502.07029 (1/n)

11 months ago 15 10 2 0

🚀 Excited to share a new interp+agents paper: 🐭🐱 MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools appearing at #NAACL2025

This was work done @msftresearch.bsky.social last summer with Jason Eisner, Justin Svegliato, Ben Van Durme, Yu Su, and Sam Thomson

1/🧵

11 months ago 12 8 1 2

When interacting with ChatGPT, have you wondered if they would ever "lie" to you? We found that under pressure, LLMs often choose deception. Our new #NAACL2025 paper, "AI-LIEDAR ," reveals models were truthful less than 50% of the time when faced with utility-truthfulness conflicts! 🤯 1/

11 months ago 25 9 1 3

1/🚨 𝗡𝗲𝘄 𝗽𝗮𝗽𝗲𝗿 𝗮𝗹𝗲𝗿𝘁 🚨
RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style?

We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline 🧵

1 year ago 9 5 1 2

THIS IS HUGE! Researchers at McMaster University have discovered a NEW peptide antibiotic that targets a broad range of disease-causing bacteria INCLUDING those RESISTANT to existing antibiotics. This discovery marks the first potential new class of antibiotics in NEARLY 30 YEARS. 🧪🧵⬇️

1 year ago 9278 2751 226 283

CDS building which looks like a jenga tower

Life update: I'm starting as faculty at Boston University
@bucds.bsky.social in 2026! BU has SCHEMES for LM interpretability & analysis, I couldn't be more pumped to join a burgeoning supergroup w/ @najoung.bsky.social @amuuueller.bsky.social. Looking for my first students, so apply and reach out!

1 year ago 244 13 35 7

You should read Article 1 of the United States Constitution. It's a trip.

1 year ago 1 0 0 0

There can be only one DB joke. And that is DB.

1 year ago 1 0 0 0

Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative ...

New preprint by @annikatjuka.bsky.social, Robert Forkel, Christoph Rzymski, and myself available, presenting a new version of the Database of Cross-Linguistic Colexifications (CLICS).

"Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data"

arxiv.org/abs/2503.11377

1 year ago 7 3 1 0

Finally found a way to shorten faculty meetings.

1 year ago 261 59 17 3

No student anywhere in America has said something as antisemitic as this

1 year ago 125 21 1 1

Midwest Speech and Language Days 2025

The meeting will feature keynote addresses by
@mohitbansal.bsky.social, @davidrmortensen.bsky.social, Karen Livescu, and Heng Ji. Plus all of your great talks and posters! nlp.nd.edu/msld25

1 year ago 4 1 0 0

I’ve been thinking about this reading from Isaiah 58 since I heard it at the Ash Wednesday service today.

“Is not this the fast that I choose:
to loose the bonds of injustice,
to undo the thongs of the yoke,
to let the oppressed go free,
and to break every yoke?

1 year ago 190 33 8 2

Trump Decried Millions Spent 'Making Mice Transgender.' It Was Cancer and Asthma Research President Trump falsely claimed that Biden spent $8 million on 'making mice transgender,' but the real research was for human health.

“Again, the mice used for clinical purposes did not undergo gender transition.”

www.rollingstone.com/politics/pol...

1 year ago 5979 1174 526 145

Congressman Al Green on X: "Today, the House GOP censured me for speaking out for the American people against @POTUS’s plan to cut Medicaid. I accept the consequences of my actions, but I refuse to stay silent in the face of injustice. #WeShallOvercome https://t.co/sVklRmPCJl" / X Today, the House GOP censured me for speaking out for the American people against @POTUS’s plan to cut Medicaid. I accept the consequences of my actions, but I refuse to stay silent in the face of injustice. #WeShallOvercome https://t.co/sVklRmPCJl

Today, the House GOP censured me for speaking out for the American people against @POTUS’s plan to cut Medicaid. I accept the consequences of my actions, but I refuse to stay silent in the face of injustice. #WeShallOvercome x.com/repalgreen/s...

1 year ago 106420 19300 10313 1820

Screenshot of Arxiv paper title, "Rejected Dialects: Biases Against African American Language in Reward Models," and author list: Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, and Maarten Sap.

Reward models for LMs are meant to align outputs with human preferences—but do they accidentally encode dialect biases? 🤔

Excited to share our paper on biases against African American Language in reward models, accepted to #NAACL2025 Findings! 🎉

Paper: arxiv.org/abs/2502.12858 (1/10)

1 year ago 38 11 1 2

I read a paper about search, but I can't quite remember what it's called.

1 year ago 9 1 1 0

Tip of the Tongue Query Elicitation for Simulated Evaluation Tip-of-the-tongue (TOT) search occurs when a user struggles to recall a specific identifier, such as a document title. While common, existing search systems often fail to effectively support TOT scena...

🚨New Breakthrough in Tip-of-the-Tongue (TOT) Retrieval Research!

We address data limitations and offer a fresh evaluation method for these complex queries.

Curious how TREC TOT track test queries are created? Check out this thread 🧵 and our paper 📄: arxiv.org/abs/2502.17776

1 year ago 18 8 2 1

everything is so shitty, read this story about a genuinely good man who saw he had an opportunity to save millions of lives and threw himself into doing so. the world is full of heroes like him.

1 year ago 8394 1813 61 19

Posts by David Mortensen