Advertisement · 728 × 90

Posts by Ivan Kartáč

Post image

In a post-CHI blog post, I talk about what I believe technology like the Anthropic Interviewer will mean for qualitative research.

doomscrollingbabel.manoel.xyz/p/qualitativ...

3 days ago 19 3 0 0
Preview
Justify Your Prompts! Abstract. When you use a large language model (LLM) in your research, you often need to formulate a prompt to elicit some relevant output from the LLM. This step is challenging since (1) LLMs are know...

New paper on LLMs and research methodology: Justify your prompts! direct.mit.edu/coli/article...

#nlproc

5 days ago 9 7 1 1

Happy to see this accepted at #ACL2026 main!

1 week ago 4 2 0 0

Is a study still qualitative if humans don't conduct the interviews or do the analysis? 🤔

4 weeks ago 2 2 0 0

My #atscience talk about Lea is now online! If you've only ever heard me talk about NLP, here's a chance to hear me rant about #atproto (the protocol you're reading this on!), social media for researchers, and the free internet.

And if you're interested in helping us build Lea, please reach out!

1 week ago 71 20 4 4
INLG2026 The 19th International Natural Language Generation Conference is scheduled to be held in Utrecht, the Netherlands from October 17 to 21, 2026.

📍It's official! #INLG2026 is coming to Utrecht, Netherlands, Oct 17-21! Hosted with support from Utrecht University and the local NLP community. Follow us here and check 2026.inlgmeeting.org for updates -- hope to see you there!

2 weeks ago 15 9 0 3

If you’re at #EACL2026 today, stop by @zdenekkasner.cz at the MME workshop to chat about our work on span annotation!

Paper: aclanthology.org/2026.mme-mai...

3 weeks ago 3 0 0 0
Advertisement

Huge thanks to my co-authors @tuetschek.bsky.social and Mateusz Lango!

4 weeks ago 0 0 0 0

This suggests that integrating LLMs in task-oriented dialogue systems that require reasoning may not be straightforward, and shows that LLMs need to be evaluated in realistic and interactive scenarios, not just static benchmarks with isolated problems.

4 weeks ago 0 0 1 0

But what exactly makes dialogue more difficult for these models? 🤔 Our ablations point mainly to multi-turn interaction, but we also observe effects of tool use and role conditioning: acting as a helpful assistant can make LLMs worse in reasoning, even with reasoning-specific instructions.

4 weeks ago 0 0 1 0
Post image

We apply it to LLMs of different sizes and architectures, and find a large gap in performance between standalone and dialogue settings. This gap remains significant even for large and proprietary models! Failure modes in dialogue include early answers, refusals, or misinterpretation of context.

4 weeks ago 0 0 1 0
Post image

We create BOULDER, a dynamic benchmark with reasoning problems in multiple travel-related domains. It spans arithmetic, temporal, and spatial reasoning with different degrees of commonsense or formal components. Test examples are generated procedurally with automatically verifiable answers.

4 weeks ago 0 0 1 0
Post image

New preprint: arxiv.org/abs/2603.20133

Does performance on reasoning benchmarks transfer to real-world settings such as task-oriented dialogue?

Not necessarily: our new benchmark tests LLMs on problems framed in both standalone and dialogue settings, and shows that dialogue makes reasoning harder.

4 weeks ago 10 5 1 1
How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt This piece by Margaret-Anne Storey is the best explanation of the term cognitive debt I've seen so far. Cognitive debt, a term gaining traction recently, instead communicates the notion that …

Short musings on "cognitive debt" - I'm seeing this in my own work, where excessive unreviewed AI-generated code leads me to lose a firm mental model of what I've built, which then makes it harder to confidently make future decisions simonwillison.net/2026/Feb/15/...

2 months ago 464 88 41 19
Post image

The 5th Generation, Evaluation, and Metrics (GEM) Workshop will be at #ACL2026!

Call for papers is out. Topics include:
🐟 LMs as evaluators
🐠 Living benchmarks
🍣 Eval with humans
and more

New for 2026: Opinion & Statement Papers!

Full CFP: gem-workshop.com/call-for-pap...

2 months ago 21 7 0 1
Advertisement
Preview
On Evaluating Cognitive Capabilities in Machines (and Other "Alien" Intelligences) (Apologies for the length of this post, which means it gets cut off in the email version.

My latest on Substack -- a write-up of the talk I gave at NeurIPS in December.

aiguide.substack.com/p/on-evaluat...

3 months ago 123 37 0 4

OpeNLGauge comes in two variants: a prompt-based ensemble and a smaller fine-tuned model, both built exclusively on open-weight LLMs (including training data!).

Thanks @tuetschek.bsky.social and @mlango.bsky.social!

7 months ago 1 0 0 0

We introduce an explainable metric for evaluating a wide range of natural language generation tasks, without any need for reference texts. Given an evaluation criterion, the metric provides fine-grained assessments of the output by highlighting and explaining problematic spans in the text.

7 months ago 0 0 1 0
Post image

Our paper "OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs" has been accepted to #INLG2025 conference!

You can read the preprint here: arxiv.org/abs/2503.11858

7 months ago 4 2 1 0
Post image Post image

#ACL2025NLP in Vienna 🇦🇹 starts today with 23 🤯 @ufal-cuni.bsky.social folks presenting their work both at the main conference and workshops. Check out our main conference papers today and on Wednesday 👇

8 months ago 22 8 1 1
Preview
Ondrej Dusek MLPrague 2025 Evaluating LLM outputs with humans and LLMs Ondřej Dušek MLPrague 30 April 2025 These slides: https://bit.ly/mlprague25-od

Slides and links to papers at bit.ly/mlprague25-od 🤓

11 months ago 2 2 0 0
Post image

Today, @tuetschek.bsky.social shared the work of his team on evaluating LLM text generation with both human annotation frameworks and LLM-based metrics. Their approach tackles the benchmark data leakage problem and how to get unseen data for unbiased LLM testing.

11 months ago 8 3 1 0
Preview
Large Language Models as Span Annotators Website for the paper Large Language Models as Span Annotators

How do LLMs compare to human crowdworkers in annotating text spans? 🧑🤖

And how can span annotation help us with evaluating texts?

Find out in our new paper: llm-span-annotators.github.io

Arxiv: arxiv.org/abs/2504.08697

1 year ago 20 7 1 2