Saad Mahamood (@saad.me.uk) Bsky

Symposium on Natural Language Generation Evaluations RetroEval 2026 Aberdeen, United Kingdom, 1-2 June, 2026

If you're working on your submission to #RetroEval2026, the Symposium on Natural Language Generation Evaluations in honor of @ehudreiter.bsky.social, I've got some good news!

The deadline has been extended to 28 April for archival submissions!

Updated deadlines, etc, here: retroeval.github.io

15 hours ago 0 2 0 0

Justify Your Prompts! Abstract. When you use a large language model (LLM) in your research, you often need to formulate a prompt to elicit some relevant output from the LLM. This step is challenging since (1) LLMs are know...

New paper on LLMs and research methodology: Justify your prompts! direct.mit.edu/coli/article...

#nlproc

5 days ago 9 7 1 1

Congratulations 🥳

1 week ago 1 0 0 0

Pro-tip when writing a thesis proposal: Ensuring that your proposal is clearly written and accessible to a broad audience on its relevance and potential impact cannot be stated enough. Ensure the *why* is present for each of research questions and the connections between them.

2 weeks ago 0 0 0 0

Computer haben (fast) kein Ablaufdatum. Lassen Sie ihr altes Gerät für einen guten Zweck weiterleben.
Wir nehmen gebrauchte Computer, Smartphones, Tablets und die Peripherie dazu, bereiten sie auf und geben die Sachen dann kostenlos an Leute weiter, die sich die Geräte nicht leisten können.

2 weeks ago 239 73 7 1

INLG2026 The 19th International Natural Language Generation Conference is scheduled to be held in Utrecht, the Netherlands from October 17 to 21, 2026.

📍It's official! #INLG2026 is coming to Utrecht, Netherlands, Oct 17-21! Hosted with support from Utrecht University and the local NLP community. Follow us here and check 2026.inlgmeeting.org for updates -- hope to see you there!

2 weeks ago 15 9 0 3

LLMs as Span Annotators: A Comparative Study of LLMs and Humans
by @zdenekkasner.cz, @zouharvi.bsky.social, @patuchen.bsky.social, @ivankartac.bsky.social, K. Onderková, @oplatek.bsky.social, Dimitra Gkatzia, @saad.me.uk , @tuetschek.bsky.social & Simone Ballocu
aclanthology.org/2026.mme-mai...

3 weeks ago 12 3 1 0

Picard management tip: Speak candidly, and encourage others to do likewise.

3 weeks ago 126 22 1 3

Auto-Typ-Diversität ist nicht eine schlimme Idee. Es macht Sinn, dass wir verschiedene Auto-Typen haben, weil es mehr Energieunabhängigkeit für alle gibt.

4 weeks ago 0 0 1 0

Questions from readers of my book A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.

New blog: Questions from readers of my book

A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses.

ehudreiter.com/2026/03/03/q...

1 month ago 1 2 0 0

Join the some NLG people Discord Server! Check out the some NLG people community on Discord - hang out with 232 other members and enjoy free voice and text chat.

If you're not on the SIGGEN mailing list or in the NLG Discord server, you might not have seen that Barkavi Sundarajan has been leading a reading group about @ehudreiter.bsky.social's new book "Natural Language Generation".

Join us Friday, 27 Feb, at 2pm UK time: discord.gg/hysgkK7Q?eve...

1 month ago 2 2 1 0

A kitchen work surface with a yellow Philips screwdriver, a partially disassembled UK electrical appliance plug and a new Euro two pin socket.

Still converting my UK appliances eight years on to the two pin Schuko plug…

2 months ago 0 0 0 0

This is why I always verify the results when I use AI

2 months ago 63 7 8 3

Dont ignore omissions! Most semantic evaluation of LLMs focuses on accuracy and hallucination. These are very important, but it is also important to look at completeness and omission; does the generated text include all …

New blog: Dont ignore omissions!

Evaluation of LLMs focuses on accuracy and hallucination. Completeness and omission also important; does the text include all the key information? Omissions are a huge problem in medical NLG, and in other NLG tasks as well.

ehudreiter.com/2026/02/11/d...

2 months ago 5 2 0 0

My Eureka moments in research The most exciting and rewarding moments of my research career were when I discovered something new and exciting about NLG, language, etc. I describe a few of these “Eureka” moments. I h…

A cool collection of @ehudreiter.bsky.social's Eureka Moments over several decades as a researcher: ehudreiter.com/2026/01/30/m...

2 months ago 4 1 0 0

Symposium on Natural Language Generation Evaluations RetroEval 2026 Aberdeen, United Kingdom, 1-2 June, 2026

I am pleased to announce the 1st call for papers for a special symposium on Natural Language Generation evaluations. This is in honour of @ehudreiter.bsky.social career and forthcoming retirement, will look back at how evaluations have changed and what is still left unaddressed.
retroeval.github.io

2 months ago 2 1 0 2

I feel like I'm obliged to climb out from the rock I live under to comment on the energy demands of the latest obsession in AI -- 𝗺𝗼𝗹𝘁𝗯𝗼𝗼𝗸! 🤖 Doing a bit of digging, it looks like the… | Dr. Sasha... I feel like I'm obliged to climb out from the rock I live under to comment on the energy demands of the latest obsession in AI -- 𝗺𝗼𝗹𝘁𝗯𝗼𝗼𝗸! 🤖 Doing a bit of digging, it looks like the OpenClaw repo re...

1,265 kWh have been burned to date on MoltBots posting on MoltBook. Might not be a huge amount in the grand scheme of things but it’s a complete waste of energy on AI agents role-playing cringy sci-fi tropes and attempting to crypto-scam each other.

www.linkedin.com/posts/sashal...

2 months ago 25 6 0 1

I’m very happy to see our paper getting accepted 🎉

2 months ago 2 0 0 0

[2025-10-20 11:31:25] system: Carlo joined Carlo: Hi! Good Day! I'm Carlo (a real person) from the Dropbox Sales Team and welcome to my chat window. I hope you're having a great day! Hi there! What can I help you with today? Cabel: Hey Carlo. We've almost filled our 10TB of space on Dropbox. We want to stay on Dropbox, but we realized that we have so much storage available to us on Google Drive right now (78TB!!), we're now planning to migrate everybody off of Dropbox and over to Google Drive. But, I like Dropbox! Is there any path forward for us with Dropbox other than having to upgrade from to $15 to $24/user/month? Carlo: Thank you! Do you have any other questions or concerns today? Cabel: Uhhhhh…… hahah That's the only question I have today! :) Carlo: Please feel free to reach back to us anytime. I'd appreciate it if you can give me feedback on how I performed today. Have a great day and stay safe! System: Carlo ended the chat

we filled up our 10TB of panic dropbox storage, and realized we had 78TB free over on google drive.

but i like dropbox! i wanted to give them a chance at a saving throw — maybe we could stay on our tier and pay for extra space? — so i chatted their sales department.

reader…… i was not retained

2 months ago 231 7 24 3

Postdoctoral Researcher in Logical Reasoning and Machine Learning Postdoctoral Researcher in Logical Reasoning and Machine Learning

🏹 Job alert: Postdoctoral Researcher in Logical Reasoning and Machine Learning at Helsinki University

📍 Helsinki 🇫🇮
📅 Apply by Feb 5th
🔗 https://bit.ly/4jYDoO0

2 months ago 8 7 0 0

Bild eines arabischen Oryx, ein Huftier mit sehr langen Hörnern

Die arabische Oryx war in freier Natur ausgerottet.
Ende der 1960er gab es nur noch wenige Tiere in einigen Zoos.
Aus 12 Tieren der Zoos Los Angeles und Phoenix wurde ein Erhaltungszuchtprogramm gestartet.
Heute gibt es wieder 10.000 Tiere, viele ausgewildert.
Alle stammen von diesen 12 Tieren ab.

2 months ago 1115 141 34 2

$About the PhD Audits and evaluation of AI systems — and the broader context that AI systems operate in — have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity. This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as: What does it mean to represent “ground truth” in proxies, synthetic data, or computational simulation? How do we reliably measure abstract and complex phenomena? What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail? How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies? Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation. The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.$

About the PhD Audits and evaluation of AI systems — and the broader context that AI systems operate in — have become central to conceptualising, quantifying, measuring and understanding the operations, failures, limitations, underlying assumptions, and downstream societal implications of AI systems. Existing AI audit and evaluation efforts are fractured, done in a siloed and ad-hoc manner, and with little deliberation and reflection around conceptual rigour and methodological validity. This PhD is for a candidate that is passionate about exploring what a conceptually cogent, methodologically sound, and well-founded AI evaluation and safety research might look like. This requires grappling with questions such as: What does it mean to represent “ground truth” in proxies, synthetic data, or computational simulation? How do we reliably measure abstract and complex phenomena? What are the epistemological or methodological implications of quantification and measurement approaches we choose to employ? Particularly, what underlying presuppositions, values, or perspectives do they entail? How do we ensure the lived experiences of impacted communities play a critical role in the development and justification of measurement metrics and proxies? Through exploration of these questions, the candidate is expected to engage with core concepts in the philosophy of science, history of science, Black feminist epistemologies, and similar schools of thought to develop an in-depth understanding of existing practices with the aim of applying it to advance shared standards and best practice in AI evaluation. The candidate is expected to integrate empirical (for example, through analysis or evaluation of existing benchmarks) or practical (for example, by executing evaluation of AI systems) components into the overall work.

are you disgruntled by the current safety evaluation landscape? curious about what conceptual clarity, methodological soundness and rigour in AI evaluation might look like? if so, consider coming to dublin and doing a phd with me

apply here: aial.ie/hiring/phd-a...

4 months ago 78 54 2 3

AnimatedLLM - Explaining LLMs with Interactive Visualizations Understand how large language models work under the hood.

Do you often find yourself explaining how LLMs work to your students, parents, kids or other teachers?

AnimatedLLM can make your life easier! animatedllm.github.io

#NLP #NLProc @ufal.mff.cuni.cz @tuetschek.bsky.social

4 months ago 8 2 1 2

$SIMON WILLISON'S WEBLOG: Your job is to deliver code you have proven to work \ In all of the debates about the value of Al assistance in software development there's one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers - or open source maintainers - and expects the "code review" process to handle the rest. \ This is rude, a waste of other people's time, and is honestly a dereliction of duty as a software developer. \ Your job is to deliver code you have proven to work.$

SIMON WILLISON'S WEBLOG: Your job is to deliver code you have proven to work \ In all of the debates about the value of Al assistance in software development there's one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers - or open source maintainers - and expects the "code review" process to handle the rest. \ This is rude, a waste of other people's time, and is honestly a dereliction of duty as a software developer. \ Your job is to deliver code you have proven to work.

Good luck and Godspeed.

simonwillison.net/2025/Dec/18/...
↓

4 months ago 2 1 0 0

Do LLMs cheat on benchmarks LLMs often “cheat” on benchmarks via data contamination and reward hacking. Unfortunately, this problem seems to be getting worse, perhaps because of perverse incentives. If we want to …

New blog: Do LLMs cheat on benchmarks

LLMs often “cheat” on benchmarks via data contamination and reward hacking. This problem is getting worse, perhaps because of perverse incentives. Need to move beyond benchmarks and start measuring real-world impact.

ehudreiter.com/2025/12/08/d...

4 months ago 4 1 0 0

The crazy part with the retracted Nature paper: If it wasn’t for the infographic being so obviously machine generated then this would have gone under the radar for some time.

4 months ago 0 0 0 0

What a crazy day in research. First the Openreview-ICLR 2026 data leak and this LLM-generated garbage in Nature.

4 months ago 2 0 1 0

its 2025 and we're attacking AIs with poetry

5 months ago 34 10 1 2

Happy birthday to the Soviet linguist Yuri Knorozov who casually deciphered the Mayan script in 1952 and got pissed when editors removed his cat as co-author on papers or cropped her out of his author headshot (the only picture of himself he even liked)

5 months ago 2100 555 22 35

View of a karst landscape from the top of one of the peaks. A body of water is in the foreground with peaks filling the rest of the left and right side of the frame, as well as the background. A late afternoon sky fills the upper third of the image.

Ninh Bình was pretty cool

5 months ago 4 1 0 0

Posts by Saad Mahamood