New paper on LLMs and research methodology: Justify your prompts! direct.mit.edu/coli/article...
#nlproc
Posts by Ondrej Dusek
Happy to see this accepted at #ACL2026 main!
AnimatedLLM: Explaining LLMs with Interactive Visualizations
by @zdenekkasner.cz and @tuetschek.bsky.social
aclanthology.org/2026.teachin...
AnimatedLLM lets you explore how LLMs work step by step. Right in your browser, no setup needed. Great for teaching or self-study! 🧵🤖
👉 animatedllm.github.io
LLMs as Span Annotators: A Comparative Study of LLMs and Humans
by @zdenekkasner.cz, @zouharvi.bsky.social, @patuchen.bsky.social, @ivankartac.bsky.social, K. Onderková, @oplatek.bsky.social, Dimitra Gkatzia, @saad.me.uk , @tuetschek.bsky.social & Simone Ballocu
aclanthology.org/2026.mme-mai...
Slides here 🙂: bit.ly/loresmt26-od
Keynote by @tuetschek.bsky.social on LLM evaluation: standard metrics fall short of catching subtle errors, and human evaluation lacks consistency. Span-level error annotation with LLM-as-judge ensembles ➡️ LLMs matching trained human annotators. At the LowResMT workshop.
#EACL2026 in Rabat 🇲🇦 starts tomorrow and @ufal.mff.cuni.cz folks will present their research. Don't miss our presentations 👇
New preprint: arxiv.org/abs/2603.20133
Does performance on reasoning benchmarks transfer to real-world settings such as task-oriented dialogue?
Not necessarily: our new benchmark tests LLMs on problems framed in both standalone and dialogue settings, and shows that dialogue makes reasoning harder.
Claude: For each example I can do a web search and then make a LLM call with the results...
Me: Why an LLM call? Can't you just figure it out yourself?
Claude: You're right, I am the LLM!
The 5th Generation, Evaluation, and Metrics (GEM) Workshop will be at #ACL2026!
Call for papers is out. Topics include:
🐟 LMs as evaluators
🐠 Living benchmarks
🍣 Eval with humans
and more
New for 2026: Opinion & Statement Papers!
Full CFP: gem-workshop.com/call-for-pap...
If you think labeling text spans with LLMs is easy, you probably have not tried it yourself (we have! 🙃).
Any method you can think of – be it tagging, matching, or indexing – has flaws.
In our new preprint, we tested them all 💪We also proposed how to improve one of them.
arxiv.org/abs/2601.16946
I'd say you can do that in Czech but it's really harsh and may look a bit petty, depending on circumstances.
Do you often find yourself explaining how LLMs work to your students, parents, kids or other teachers?
AnimatedLLM can make your life easier! animatedllm.github.io
#NLP #NLProc @ufal.mff.cuni.cz @tuetschek.bsky.social
🤦♂️
🔤 Pretraining Language Models with LoRA and Artificial Languages
Nalin Kumar, Mateusz Lango, @tuetschek.bsky.social t
aclanthology.org/2025.babylm-...
Constructed artificial languages with LoRA affects language model development.
🎓 You are an LLM teaching a smaller model everything you know: Multi-task pretraining of language models with LLM-designed study plans
Wiktor Kamzela, Mateusz Lango, @tuetschek.bsky.social
aclanthology.org/2025.babylm-...
📚 SRS-Stories: Vocabulary-constrained multilingual story generation for language learning
Wiktor Kamzela, Mateusz Lango & @toonietuesday.bsky.social
aclanthology.org/2025.emnlp-i...
LLM stories teach vocab while reviewing learned words via Spaced Repetition-more grammatical than standard generation
🤖 LLM Agents Implement an NLG System from Scratch
Mateusz Lango, Ondrej Dusek
aclanthology.org/2025.emnlp-i...
LLM agents can autonomously build interpretable, rule-based RDF-to-text generators from scratch, combining the LLMs with the transparency and reliability of traditional rule-based systems.
👥 Can Large Language Models Personalize Dialogues to Generational Styles?
P. Balestrucci, @tuetschek.bsky.social, L. Anselma, A. Mazzei
aclanthology.org/2025.finding...
Can LLMs adapt dialogues to generational styles? We show with P-MultiWoZ that models capture patterns from Boomers to Gen Z.
📊 Real-World Summarization: When Evaluation Reaches Its Limits
@patuchen.bsky.social , @tuetschek.bsky.social , @saad.me.uk
aclanthology.org/2025.finding...
For hotel highlights, metrics like word overlap surprisingly match human judgments better than complex methods. LLMs unreliable as evaluators.
Picture of the One Pillar Pagoda in Hanoi, a pagoda raised up over a green pond surrounded by greenery
The registration page for #INLG2025 is now live! Join us in Vietnam at the Oct 29 - Nov 2 for the best conference on #NaturalLanguageGeneration
2025.inlgmeeting.org/registration...
Curious to see what will be presented? Check out this list of accepted papers! 2025.inlgmeeting.org/accepted-pap...
Check out the slides from our SCAI'2025 #convsearch workshop collocated with @ijcai.org #IJCAI2025 on LLMs, retrieval & QA, recommendations, negotiations, evaluation and transparency
scai.info/scai-2025
@patuchen.bsky.social @maik-froebe.bsky.social @tuetschek.bsky.social @mila-quebec.bsky.social
Our paper "OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs" has been accepted to #INLG2025 conference!
You can read the preprint here: arxiv.org/abs/2503.11858
It's fine by me if they generate it, as long as it works and they know how... but I've been getting loads of roughly plausible but non-functional code, with hallucinated API calls etc. 😒. Not that many emojis though (in docs only).
FreshTab: Sourcing Fresh Resources for Table-to-Text Generation Evaluation
by @navitas.bsky.social, @oplatek.bsky.social, @zdenekkasner.bsky.social, @tuetschek.bsky.social .bsky.social
ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation
by @navitas.bsky.social, M. Lango, @patuchen.bsky.social, @tuetschek.bsky.social
Challenge to reproduce human evaluations from NLP papers, testing the reproducibility of evaluation studies
OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs
by @ivankartac.bsky.social, M. Lango, @tuetschek.bsky.social
arxiv.org/abs/2503.11858
Open-source NLG evaluation metric that explains errors and matches human judgments without proprietary models
#ACL2025NLP in Vienna 🇦🇹 starts today with 23 🤯 @ufal-cuni.bsky.social folks presenting their work both at the main conference and workshops. Check out our main conference papers today and on Wednesday 👇