Advertisement ยท 728 ร— 90

Posts by Musashi Hinck

Thereโ€™s plenty of evidence for political bias in LLMs, but very few evals reflect realistic LLM use cases โ€” which is where bias actually matters.

IssueBench, our attempt to fix this, is accepted at TACL, and I will be at #EMNLP2025 next week to talk about it!

New results ๐Ÿงต

5 months ago 32 11 1 0

apparently youโ€™re supposed to boil the water before you fill the bottom bit, because you want to avoid scalding your data as much as possible

7 months ago 1 0 0 0
Post image

New job ad: Assistant Professor of Quantitative Social Science, Dartmouth College apply.interfolio.com/172357

Please share with your networks. I am the search chair and happy to answer questions!

8 months ago 178 168 2 8
Post image Post image Post image Post image

Exciting work coming from @pranavgoel.bsky.social looking at the effect of ChatGPT and similar tools on web browsing habits.

When people use these tools do they tend to stay on the platform instead of being referred elsewhere? Could this lead to the end of the open web? #pacss2025 #polnet2025

8 months ago 25 8 2 0

๐Ÿ’ฏ, when talking to AI doomers in 2023 I thought they had a naive view of how this technology will be integrated, but now itโ€™s looking like I am the naive one (still deeply skeptical of how many of their scenarios play out though)

8 months ago 5 0 0 0
Post image

๐Ÿ“ขNew POSITION PAPER: Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts

Despite recent results, SAEs aren't dead! They can still be useful to mech interp, and also much more broadly: across FAccT, computational social science, and ML4H. ๐Ÿงต

8 months ago 41 4 1 3

Grateful to win Best Paper at ACL for our work on Fairness through Difference Awareness with my amazing collaborators!! Check out the paper for why we think fairness has both gone too far, and at the same time, not far enough aclanthology.org/2025.acl-lon...

8 months ago 28 4 0 0
Advertisement
Post image

New working paper: โ€œSurvey Estimates of Wartime Mortality,โ€ with Gary King, available at gking.harvard.edu/sibs. We provide the first formal proofs of the statistical properties of existing mortality estimators, along with empirical illustrations, to develop intuitions that guide best practices.

8 months ago 5 3 0 0

Love this! Especially the explicit operationalization of what โ€œbiasโ€ they are measuring via specifying the relevant counterfactual.
Definitely an approach that more papers talking about effects can incorporate to better clarify what the phenomenon they are studying.

10 months ago 2 1 0 0

On second thought definitely two!

10 months ago 2 0 0 0

Iโ€™d do 1 or 2. Definitely get an egg custard (tart) as a snack too :) Enjoy!

10 months ago 2 0 1 0
Post image

New paper with Rebecca Johnson (@rebeccaj.bsky.social) on parental perceptions of using algorithms to allocate scarce resources in schools, now out in Sociological Science (@sociologicalsci.bsky.social):

11 months ago 29 8 2 0

Thrilled to share that this is out in @pnas.org today! ๐ŸŽ‰

We show that linguistic generalization in language models can be due to underlying analogical mechanisms.

Shoutout to my amazing co-authors @weissweiler.bsky.social, @davidrmortensen.bsky.social, Hinrich Schรผtze, and Janet Pierrehumbert!

11 months ago 35 6 1 2
Post image

๐‡๐จ๐ฐ ๐œ๐š๐ง ๐ฐ๐ž ๐ฉ๐ž๐ซ๐Ÿ๐ž๐œ๐ญ๐ฅ๐ฒ ๐ž๐ซ๐š๐ฌ๐ž ๐œ๐จ๐ง๐œ๐ž๐ฉ๐ญ๐ฌ ๐Ÿ๐ซ๐จ๐ฆ ๐‹๐‹๐Œ๐ฌ?

Our method, Perfect Erasure Functions (PEF), erases concepts perfectly from LLM representations. We analytically derive PEF w/o parameter estimation. PEFs achieve pareto optimal erasure-utility tradeoff backed w/ theoretical guarantees. #AISTATS2025 ๐Ÿงต

1 year ago 37 8 2 3
Post image

How does the public conceptualize AI? Rather than self-reported measures, we use metaphors to understand the nuance and complexity of peopleโ€™s mental models. In our #FAccT2025 paper, we analyzed 12,000 metaphors collected over 12 months to track shifts in public perceptions.

11 months ago 49 14 3 1

๐Ÿ’ก Ever wondered how social media and digital technology shapes our democracy?

Join our team @CSMaP_NYU as a Research Engingeer and help us build the tools that power cutting-edge research on the digital public sphere.

๐Ÿš€ Apply now!

apply.interfolio.com/165833

11 months ago 6 7 1 1
Post image

It is critical for scientific integrity that we trust our measure of progress.

The @lmarena.bsky.social has become the go-to evaluation for AI progress.

Our release today demonstrates the difficulty in maintaining fair evaluations on the Arena, despite best intentions.

11 months ago 42 9 3 4
Advertisement

๐Ÿ“Œ

11 months ago 3 0 0 0
Preview
From the changemyview community on Reddit Explore this post and more from the changemyview community

The mods of r/ChangeMyView shared the sub was the subject of a study to test the persuasiveness of LLMs & that they didn't consent. Thereโ€™s a lot that went wrong, so hereโ€™s a ๐Ÿงต unpacking it, along with some ideas for how to do research with online communities ethically. tinyurl.com/59tpt988

11 months ago 1272 519 52 177
Preview
Large Language Models in Qualitative Research: Uses, Tensions, and Intentions | Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems

Excited to be presenting "LLMs in Qualitative Research: Uses, Tensions, and Intentions" with @mariannealq.bsky.social at #CHI2025 today!
๐Ÿ†• paper: dl.acm.org/doi/10.1145/...

11 months ago 39 10 2 0
Design-based Supervised Learning R package dsl implements design-based supervised learning (DSL) proposed in Egami, Hinck, Stewart, and Wei (2023). DSL is a general estimation framework for using predicted variables in statistical an...

On point 1, you can account for this bias with tools like Design-based Supervised Learning (naokiegami.com/dsl/)!
This framework uses a small number of randomly sampled gold standard labels to correct bias in downstream estimates based on error-prone proxies like LLM annotations

11 months ago 4 0 0 0
Logo for MIB: A Mechanistic Interpretability Benchmark

Logo for MIB: A Mechanistic Interpretability Benchmark

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose ๐Ÿ˜Ž ๐— ๐—œ๐—•: a ๐— echanistic ๐—œnterpretability ๐—•enchmark!

1 year ago 51 15 1 6
A deepseek whale about to overthink until the Terminator tells it to answer right away.

A deepseek whale about to overthink until the Terminator tells it to answer right away.

Check out our new paper on benchmarking and mitigating overthinking in reasoning models!

From a simple observational measure of overthinking, we introduce Thought Terminator, a black-box, training-free decoding technique where RMs set their own deadlines and follow them

arxiv.org/abs/2504.13367

1 year ago 27 2 2 1

ModernBERT or DeBERTaV3?

What's driving performance: architecture or data?

To find out we pretrained ModernBERT on the same dataset as CamemBERTaV2 (a DeBERTaV3 model) to isolate architecture effects.

Here are our findings:

1 year ago 43 15 3 0
Preview
Why do LLaVA Vision-Language Models Reply to Images in English? Musashi Hinck, Carolin Holtermann, Matthew Lyle Olson, Florian Schneider, Sungduk Yu, Anahita Bhiwandiwalla, Anne Lauscher, Shao-Yen Tseng, Vasudev Lal. Findings of the Association for Computational L...

And then our EMNLP paper last year finds that prompting LLaVA-style VLMs causes a loss in fidelity: aclanthology.org/2024.finding...

1 year ago 0 0 0 0
Advertisement
Preview
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ Large language models (LLMs) need to serve everyone, including a global majority of non-English speakers. However, most LLMs today, and open LLMs in particular, are often intended for use in just Engl...

@carolin-holtermann.bsky.social, @paul-rottger.bsky.social and @a-lauscher.bsky.social develop a benchmark for this problem in arxiv.org/abs/2403.03814

1 year ago 1 0 1 0
Llama 4 system prompt. Highlighted text: "Respond in the language the user speaks to you in, unless they ask otherwise."

Llama 4 system prompt. Highlighted text: "Respond in the language the user speaks to you in, unless they ask otherwise."

Language Fidelity--having an LLM reply in the same language as the user's query--has made its way into the #Llama4 system prompt!

Some interesting work from co-authors and myself on this problem (short thread):
- arxiv.org/abs/2403.03814
- aclanthology.org/2024.finding...

1 year ago 2 0 1 0
Preview
Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmark...

Check out the paper at:
๐Ÿ“œPaper: arxiv.org/abs/2504.07072
๐Ÿ’ฟData: hf.co/datasets/Coh...
๐ŸŒWebsite: cohere.com/research/kal...
Huge thanks to everyone involved! This was a big collaboration ๐Ÿ‘

1 year ago 2 1 0 0
Post image

[New preprint!] Do Chinese AI Models Speak Chinese Languages? Not really. Chinese LLMs like DeepSeek are better at French than Cantonese. Joint work with
Unso Jo and @dmimno.bsky.social . Link to paper: arxiv.org/pdf/2504.00289
๐Ÿงต

1 year ago 25 6 1 0
Post image

Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers?

We create ONERULER ๐Ÿ’, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all!

Our analysis across 26 languages ๐Ÿงต๐Ÿ‘‡

1 year ago 14 5 1 3