If you're working on your submission to #RetroEval2026, the Symposium on Natural Language Generation Evaluations in honor of @ehudreiter.bsky.social, I've got some good news!
The deadline has been extended to 28 April for archival submissions!
Updated deadlines, etc, here: retroeval.github.io
Posts by Saad Mahamood
New paper on LLMs and research methodology: Justify your prompts! direct.mit.edu/coli/article...
#nlproc
Congratulations 🥳
Pro-tip when writing a thesis proposal: Ensuring that your proposal is clearly written and accessible to a broad audience on its relevance and potential impact cannot be stated enough. Ensure the *why* is present for each of research questions and the connections between them.
Computer haben (fast) kein Ablaufdatum. Lassen Sie ihr altes Gerät für einen guten Zweck weiterleben.
Wir nehmen gebrauchte Computer, Smartphones, Tablets und die Peripherie dazu, bereiten sie auf und geben die Sachen dann kostenlos an Leute weiter, die sich die Geräte nicht leisten können.
📍It's official! #INLG2026 is coming to Utrecht, Netherlands, Oct 17-21! Hosted with support from Utrecht University and the local NLP community. Follow us here and check 2026.inlgmeeting.org for updates -- hope to see you there!
LLMs as Span Annotators: A Comparative Study of LLMs and Humans
by @zdenekkasner.cz, @zouharvi.bsky.social, @patuchen.bsky.social, @ivankartac.bsky.social, K. Onderková, @oplatek.bsky.social, Dimitra Gkatzia, @saad.me.uk , @tuetschek.bsky.social & Simone Ballocu
aclanthology.org/2026.mme-mai...
Picard management tip: Speak candidly, and encourage others to do likewise.
Auto-Typ-Diversität ist nicht eine schlimme Idee. Es macht Sinn, dass wir verschiedene Auto-Typen haben, weil es mehr Energieunabhängigkeit für alle gibt.
New blog: Questions from readers of my book
A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.
ehudreiter.com/2026/03/03/q...
If you're not on the SIGGEN mailing list or in the NLG Discord server, you might not have seen that Barkavi Sundarajan has been leading a reading group about @ehudreiter.bsky.social's new book "Natural Language Generation".
Join us Friday, 27 Feb, at 2pm UK time: discord.gg/hysgkK7Q?eve...
A kitchen work surface with a yellow Philips screwdriver, a partially disassembled UK electrical appliance plug and a new Euro two pin socket.
Still converting my UK appliances eight years on to the two pin Schuko plug…
This is why I always verify the results when I use AI
New blog: Dont ignore omissions!
Evaluation of LLMs focuses on accuracy and hallucination. Completeness and omission also important; does the text include all the key information? Omissions are a huge problem in medical NLG, and in other NLG tasks as well.
ehudreiter.com/2026/02/11/d...
A cool collection of @ehudreiter.bsky.social's Eureka Moments over several decades as a researcher: ehudreiter.com/2026/01/30/m...
I am pleased to announce the 1st call for papers for a special symposium on Natural Language Generation evaluations. This is in honour of @ehudreiter.bsky.social career and forthcoming retirement, will look back at how evaluations have changed and what is still left unaddressed.
retroeval.github.io
1,265 kWh have been burned to date on MoltBots posting on MoltBook. Might not be a huge amount in the grand scheme of things but it’s a complete waste of energy on AI agents role-playing cringy sci-fi tropes and attempting to crypto-scam each other.
www.linkedin.com/posts/sashal...
I’m very happy to see our paper getting accepted 🎉
[2025-10-20 11:31:25] system: Carlo joined Carlo: Hi! Good Day! I'm Carlo (a real person) from the Dropbox Sales Team and welcome to my chat window. I hope you're having a great day! Hi there! What can I help you with today? Cabel: Hey Carlo. We've almost filled our 10TB of space on Dropbox. We want to stay on Dropbox, but we realized that we have so much storage available to us on Google Drive right now (78TB!!), we're now planning to migrate everybody off of Dropbox and over to Google Drive. But, I like Dropbox! Is there any path forward for us with Dropbox other than having to upgrade from to $15 to $24/user/month? Carlo: Thank you! Do you have any other questions or concerns today? Cabel: Uhhhhh…… hahah That's the only question I have today! :) Carlo: Please feel free to reach back to us anytime. I'd appreciate it if you can give me feedback on how I performed today. Have a great day and stay safe! System: Carlo ended the chat
we filled up our 10TB of panic dropbox storage, and realized we had 78TB free over on google drive.
but i like dropbox! i wanted to give them a chance at a saving throw — maybe we could stay on our tier and pay for extra space? — so i chatted their sales department.
reader…… i was not retained
🏹 Job alert: Postdoctoral Researcher in Logical Reasoning and Machine Learning at Helsinki University
📍 Helsinki 🇫🇮
📅 Apply by Feb 5th
🔗 https://bit.ly/4jYDoO0
Bild eines arabischen Oryx, ein Huftier mit sehr langen Hörnern
Die arabische Oryx war in freier Natur ausgerottet.
Ende der 1960er gab es nur noch wenige Tiere in einigen Zoos.
Aus 12 Tieren der Zoos Los Angeles und Phoenix wurde ein Erhaltungszuchtprogramm gestartet.
Heute gibt es wieder 10.000 Tiere, viele ausgewildert.
Alle stammen von diesen 12 Tieren ab.
About the PhD Audits and evaluation of AI systems — and the broader context that AI systems operate in — have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity. This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as: What does it mean to represent “ground truth” in proxies, synthetic data, or computational simulation? How do we reliably measure abstract and complex phenomena? What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail? How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies? Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation. The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.
are you disgruntled by the current safety evaluation landscape? curious about what conceptual clarity, methodological soundness and rigour in AI evaluation might look like? if so, consider coming to dublin and doing a phd with me
apply here: aial.ie/hiring/phd-a...
Do you often find yourself explaining how LLMs work to your students, parents, kids or other teachers?
AnimatedLLM can make your life easier! animatedllm.github.io
#NLP #NLProc @ufal.mff.cuni.cz @tuetschek.bsky.social
SIMON WILLISON'S WEBLOG: Your job is to deliver code you have proven to work \ In all of the debates about the value of Al assistance in software development there's one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers - or open source maintainers - and expects the "code review" process to handle the rest. \ This is rude, a waste of other people's time, and is honestly a dereliction of duty as a software developer. \ Your job is to deliver code you have proven to work.
Good luck and Godspeed.
simonwillison.net/2025/Dec/18/...
↓
New blog: Do LLMs cheat on benchmarks
LLMs often “cheat” on benchmarks via data contamination and reward hacking. This problem is getting worse, perhaps because of perverse incentives. Need to move beyond benchmarks and start measuring real-world impact.
ehudreiter.com/2025/12/08/d...
The crazy part with the retracted Nature paper: If it wasn’t for the infographic being so obviously machine generated then this would have gone under the radar for some time.
What a crazy day in research. First the Openreview-ICLR 2026 data leak and this LLM-generated garbage in Nature.
its 2025 and we're attacking AIs with poetry
Happy birthday to the Soviet linguist Yuri Knorozov who casually deciphered the Mayan script in 1952 and got pissed when editors removed his cat as co-author on papers or cropped her out of his author headshot (the only picture of himself he even liked)
View of a karst landscape from the top of one of the peaks. A body of water is in the foreground with peaks filling the rest of the left and right side of the frame, as well as the background. A late afternoon sky fills the upper third of the image.
Ninh Bình was pretty cool