In a post-CHI blog post, I talk about what I believe technology like the Anthropic Interviewer will mean for qualitative research.
doomscrollingbabel.manoel.xyz/p/qualitativ...
Posts by Ivan Kartáč
New paper on LLMs and research methodology: Justify your prompts! direct.mit.edu/coli/article...
#nlproc
Happy to see this accepted at #ACL2026 main!
Is a study still qualitative if humans don't conduct the interviews or do the analysis? 🤔
My #atscience talk about Lea is now online! If you've only ever heard me talk about NLP, here's a chance to hear me rant about #atproto (the protocol you're reading this on!), social media for researchers, and the free internet.
And if you're interested in helping us build Lea, please reach out!
📍It's official! #INLG2026 is coming to Utrecht, Netherlands, Oct 17-21! Hosted with support from Utrecht University and the local NLP community. Follow us here and check 2026.inlgmeeting.org for updates -- hope to see you there!
If you’re at #EACL2026 today, stop by @zdenekkasner.cz at the MME workshop to chat about our work on span annotation!
Paper: aclanthology.org/2026.mme-mai...
Huge thanks to my co-authors @tuetschek.bsky.social and Mateusz Lango!
This suggests that integrating LLMs in task-oriented dialogue systems that require reasoning may not be straightforward, and shows that LLMs need to be evaluated in realistic and interactive scenarios, not just static benchmarks with isolated problems.
But what exactly makes dialogue more difficult for these models? 🤔 Our ablations point mainly to multi-turn interaction, but we also observe effects of tool use and role conditioning: acting as a helpful assistant can make LLMs worse in reasoning, even with reasoning-specific instructions.
We apply it to LLMs of different sizes and architectures, and find a large gap in performance between standalone and dialogue settings. This gap remains significant even for large and proprietary models! Failure modes in dialogue include early answers, refusals, or misinterpretation of context.
We create BOULDER, a dynamic benchmark with reasoning problems in multiple travel-related domains. It spans arithmetic, temporal, and spatial reasoning with different degrees of commonsense or formal components. Test examples are generated procedurally with automatically verifiable answers.
New preprint: arxiv.org/abs/2603.20133
Does performance on reasoning benchmarks transfer to real-world settings such as task-oriented dialogue?
Not necessarily: our new benchmark tests LLMs on problems framed in both standalone and dialogue settings, and shows that dialogue makes reasoning harder.
Short musings on "cognitive debt" - I'm seeing this in my own work, where excessive unreviewed AI-generated code leads me to lose a firm mental model of what I've built, which then makes it harder to confidently make future decisions simonwillison.net/2026/Feb/15/...
The 5th Generation, Evaluation, and Metrics (GEM) Workshop will be at #ACL2026!
Call for papers is out. Topics include:
🐟 LMs as evaluators
🐠 Living benchmarks
🍣 Eval with humans
and more
New for 2026: Opinion & Statement Papers!
Full CFP: gem-workshop.com/call-for-pap...
My latest on Substack -- a write-up of the talk I gave at NeurIPS in December.
aiguide.substack.com/p/on-evaluat...
OpeNLGauge comes in two variants: a prompt-based ensemble and a smaller fine-tuned model, both built exclusively on open-weight LLMs (including training data!).
Thanks @tuetschek.bsky.social and @mlango.bsky.social!
We introduce an explainable metric for evaluating a wide range of natural language generation tasks, without any need for reference texts. Given an evaluation criterion, the metric provides fine-grained assessments of the output by highlighting and explaining problematic spans in the text.
Our paper "OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs" has been accepted to #INLG2025 conference!
You can read the preprint here: arxiv.org/abs/2503.11858
#ACL2025NLP in Vienna 🇦🇹 starts today with 23 🤯 @ufal-cuni.bsky.social folks presenting their work both at the main conference and workshops. Check out our main conference papers today and on Wednesday 👇
Today, @tuetschek.bsky.social shared the work of his team on evaluating LLM text generation with both human annotation frameworks and LLM-based metrics. Their approach tackles the benchmark data leakage problem and how to get unseen data for unbiased LLM testing.