Ondrej Dusek (@tuetschek) Bsky

Justify Your Prompts! Abstract. When you use a large language model (LLM) in your research, you often need to formulate a prompt to elicit some relevant output from the LLM. This step is challenging since (1) LLMs are know...

New paper on LLMs and research methodology: Justify your prompts! direct.mit.edu/coli/article...

#nlproc

5 days ago 9 7 1 1

Happy to see this accepted at #ACL2026 main!

1 week ago 4 2 0 0

AnimatedLLM: Explaining LLMs with Interactive Visualizations
by @zdenekkasner.cz and @tuetschek.bsky.social
aclanthology.org/2026.teachin...
AnimatedLLM lets you explore how LLMs work step by step. Right in your browser, no setup needed. Great for teaching or self-study! 🧵🤖
👉 animatedllm.github.io

3 weeks ago 4 1 1 0

LLMs as Span Annotators: A Comparative Study of LLMs and Humans
by @zdenekkasner.cz, @zouharvi.bsky.social, @patuchen.bsky.social, @ivankartac.bsky.social, K. Onderková, @oplatek.bsky.social, Dimitra Gkatzia, @saad.me.uk , @tuetschek.bsky.social & Simone Ballocu
aclanthology.org/2026.mme-mai...

3 weeks ago 12 3 1 0

Slides here 🙂: bit.ly/loresmt26-od

3 weeks ago 4 0 0 0

Keynote by @tuetschek.bsky.social on LLM evaluation: standard metrics fall short of catching subtle errors, and human evaluation lacks consistency. Span-level error annotation with LLM-as-judge ensembles ➡️ LLMs matching trained human annotators. At the LowResMT workshop.

3 weeks ago 9 1 2 0

#EACL2026 in Rabat 🇲🇦 starts tomorrow and @ufal.mff.cuni.cz folks will present their research. Don't miss our presentations 👇

4 weeks ago 13 2 1 0

New preprint: arxiv.org/abs/2603.20133

Does performance on reasoning benchmarks transfer to real-world settings such as task-oriented dialogue?

Not necessarily: our new benchmark tests LLMs on problems framed in both standalone and dialogue settings, and shows that dialogue makes reasoning harder.

4 weeks ago 10 5 1 1

Claude: For each example I can do a web search and then make a LLM call with the results...

Me: Why an LLM call? Can't you just figure it out yourself?

Claude: You're right, I am the LLM!

2 months ago 43 7 2 2

a cartoon illustration of a bird with sunglasses on ALT: a cartoon illustration of a bird with sunglasses on

2 months ago 1 0 0 0

The 5th Generation, Evaluation, and Metrics (GEM) Workshop will be at #ACL2026!

Call for papers is out. Topics include:
🐟 LMs as evaluators
🐠 Living benchmarks
🍣 Eval with humans
and more

New for 2026: Opinion & Statement Papers!

Full CFP: gem-workshop.com/call-for-pap...

2 months ago 21 7 0 1

If you think labeling text spans with LLMs is easy, you probably have not tried it yourself (we have! 🙃).

Any method you can think of – be it tagging, matching, or indexing – has flaws.

In our new preprint, we tested them all 💪We also proposed how to improve one of them.

arxiv.org/abs/2601.16946

2 months ago 40 6 2 3

I'd say you can do that in Czech but it's really harsh and may look a bit petty, depending on circumstances.

3 months ago 1 0 0 0

AnimatedLLM - Explaining LLMs with Interactive Visualizations Understand how large language models work under the hood.

Do you often find yourself explaining how LLMs work to your students, parents, kids or other teachers?

AnimatedLLM can make your life easier! animatedllm.github.io

#NLP #NLProc @ufal.mff.cuni.cz @tuetschek.bsky.social

4 months ago 8 2 1 2

🤦‍♂️

4 months ago 0 0 0 0

Pretraining Language Models with LoRA and Artificial Languages Nalin Kumar, Mateusz Lango, Ondrej Dusek. Proceedings of the First BabyLM Workshop. 2025.

🔤 Pretraining Language Models with LoRA and Artificial Languages
Nalin Kumar, Mateusz Lango, @tuetschek.bsky.social t
aclanthology.org/2025.babylm-...
Constructed artificial languages with LoRA affects language model development.

5 months ago 3 1 1 0

You are an LLM teaching a smaller model everything you know: Multi-task pretraining of language models with LLM-designed study plans Wiktor Kamzela, Mateusz Lango, Ondrej Dusek. Proceedings of the First BabyLM Workshop. 2025.

🎓 You are an LLM teaching a smaller model everything you know: Multi-task pretraining of language models with LLM-designed study plans
Wiktor Kamzela, Mateusz Lango, @tuetschek.bsky.social
aclanthology.org/2025.babylm-...

5 months ago 2 1 1 0

📚 SRS-Stories: Vocabulary-constrained multilingual story generation for language learning
Wiktor Kamzela, Mateusz Lango & @toonietuesday.bsky.social
aclanthology.org/2025.emnlp-i...
LLM stories teach vocab while reviewing learned words via Spaced Repetition-more grammatical than standard generation

5 months ago 3 1 1 0

LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators Mateusz Lango, Ondrej Dusek. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2025.

🤖 LLM Agents Implement an NLG System from Scratch
Mateusz Lango, Ondrej Dusek
aclanthology.org/2025.emnlp-i...
LLM agents can autonomously build interpretable, rule-based RDF-to-text generators from scratch, combining the LLMs with the transparency and reliability of traditional rule-based systems.

5 months ago 3 1 1 0

Can Large Language Models Personalize Dialogues to Generational Styles? Pier Felice Balestrucci, Ondrej Dusek, Luca Anselma, Alessandro Mazzei. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025.

👥 Can Large Language Models Personalize Dialogues to Generational Styles?
P. Balestrucci, @tuetschek.bsky.social, L. Anselma, A. Mazzei
aclanthology.org/2025.finding...
Can LLMs adapt dialogues to generational styles? We show with P-MultiWoZ that models capture patterns from Boomers to Gen Z.

5 months ago 3 1 1 0

Real-World Summarization: When Evaluation Reaches Its Limits Patrícia Schmidtová, Ondrej Dusek, Saad Mahamood. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025.

📊 Real-World Summarization: When Evaluation Reaches Its Limits
@patuchen.bsky.social , @tuetschek.bsky.social , @saad.me.uk
aclanthology.org/2025.finding...
For hotel highlights, metrics like word overlap surprisingly match human judgments better than complex methods. LLMs unreliable as evaluators.

5 months ago 4 2 1 0

5 months ago 101 16 8 3

Picture of the One Pillar Pagoda in Hanoi, a pagoda raised up over a green pond surrounded by greenery

The registration page for #INLG2025 is now live! Join us in Vietnam at the Oct 29 - Nov 2 for the best conference on #NaturalLanguageGeneration

2025.inlgmeeting.org/registration...

Curious to see what will be presented? Check out this list of accepted papers! 2025.inlgmeeting.org/accepted-pap...

7 months ago 4 4 0 0

SCAI 2025 Online Event on Search-Oriented Conversational AI.

Check out the slides from our SCAI'2025 #convsearch workshop collocated with @ijcai.org #IJCAI2025 on LLMs, retrieval & QA, recommendations, negotiations, evaluation and transparency

scai.info/scai-2025

@patuchen.bsky.social @maik-froebe.bsky.social @tuetschek.bsky.social @mila-quebec.bsky.social

7 months ago 9 3 0 0

Our paper "OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs" has been accepted to #INLG2025 conference!

You can read the preprint here: arxiv.org/abs/2503.11858

7 months ago 4 2 1 0

It's fine by me if they generate it, as long as it works and they know how... but I've been getting loads of roughly plausible but non-functional code, with hallucinated API calls etc. 😒. Not that many emojis though (in docs only).

8 months ago 0 0 1 0

FreshTab: Sourcing Fresh Resources for Table-to-Text Generation Evaluation
by @navitas.bsky.social, ‪@oplatek.bsky.social‬, ‪@zdenekkasner.bsky.social‬, @tuetschek.bsky.social .bsky.social‬

8 months ago 6 1 1 0

ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation
by @navitas.bsky.social, M. Lango, @patuchen.bsky.social, @tuetschek.bsky.social
Challenge to reproduce human evaluations from NLP papers, testing the reproducibility of evaluation studies

8 months ago 6 1 1 0

OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs
by @ivankartac.bsky.social, M. Lango, @tuetschek.bsky.social
arxiv.org/abs/2503.11858
Open-source NLG evaluation metric that explains errors and matches human judgments without proprietary models

8 months ago 7 1 1 0

#ACL2025NLP in Vienna 🇦🇹 starts today with 23 🤯 @ufal-cuni.bsky.social folks presenting their work both at the main conference and workshops. Check out our main conference papers today and on Wednesday 👇

8 months ago 22 8 1 1

Posts by Ondrej Dusek