6/ Thank you for the hard work, the patience, and the brilliance! I learned so much from all of you.
More to come soon :)
Posts by Martin Gubri
5/ And to all my brilliant co-authors since 2023: @dnnslmr.bsky.social , Sangdoo , @hwaranlee.bsky.social , Siwon, Haritz, Tommaso, @anmolgoel.bsky.social , @cemde.bsky.social , Ahmed, Elena, Salman, Asim , @arubique.bsky.social , @adamdaviesnlp.bsky.social , @elinguyen.bsky.social , Michael, Erik.
4/ Deepest gratitude to Joon @coallaoh.bsky.social Working with you reshaped how I think about research, mentorship, and what top scientific work takes. Thank you for the trust, the standards, and everything you taught me.
3/ My role evolved from co-author to main supervisor. I learned so many things: how to supervise people and bring out their best work, how to run a research agenda, and how to broaden my focus from adversarial examples to the wider field of trustworthy AI.
2/ One number I'm especially proud of: 8 out of 8. Every single paper led by our research interns or me was accepted at a top venue on the first submission! NeurIPS, ACL, EMNLP, ICLR, NAACL Findings, NeurIPS D&B. A streak I still can't quite believe 😊
View from Tubingen
1/ My contract at @parameterlab.bsky.social ended last week, after 2.5 years (since Sept 2023, with some collaboration before).
I had the chance to lead research on trustworthy AI for LLMs alongside an incredible group of people.
(Neckarfront. All Tübingen researcher have to post it once!)
🎉 Our privacy collapse paper has been accepted at #ACL 2026 (main)!
Contextual privacy is fragile: fine-tune an LLM on benign data, and it can overshare personal information.
This is silent. Safety suites don't measure contextual privacy, which is a problem now that most applications are agentic.
Kudos to Asim Mohamed for his first first-authored paper!
Paper: arxiv.org/abs/2510.18019
Code: github.com/asimzz/steam
🔹Google Translate is enough. No high-quality translator needed
🔹 Robust to translator mismatch: attacker uses GPT-4o or DeepSeek, STEAM still holds
🔹 Works even when the attack language is not in the candidate pool
🔹 Compatible with any watermarking method
🔹 Retro-compatible
STEAM avg AUC >0.965 across all language categories
vs. X-SIR: +0.25 AUC, +44%p TPR@1%
vs. X-KGW: +0.22 AUC, +31%p TPR@1%
Biggest gains on Tamil and Hindi, exactly where existing methods fail most. 🌍
Overview of STEAM
STEAM 🚂 works at detection time, not generation.
Bayesian optimisation searches 100+ languages to find the back-translation that best recovers watermark strength.
"home" marked green → translate to Korean → "집" → back-translate → "home", still green ✅
Semantic clustering
Languages with larger tokenizer vocabular- ies have higher watermark robustness.
Current multilingual watermarking groups words across languages and assigns them the same watermark signal (eg. green).
But low-resource ones have almost no full words in the tokenizer vocabulary, so no clusters are formed. The method collapses to the monolingual baseline.
Translation attack against LLM watermarking.
Robustness (AUC) x languages
🌍 We've made LLM watermarking equally robust across all languages we studied, while scaling to 100+ languages!
Even sota watermarks can be removed by translating to another language, eg. Tamil. This hits hardest in low-resource languages, where moderation tools are already weak.
🧵
The D&B track now has a larger scope and a new name: Evaluation & Datasets.
It focuses on evaluation itself as a scientific object.
It is really nice to have somewhere for critical analysis of evaluation and negative results. It was really missing in ML!
NeurIPS deadline is out!
Add the 6th of May to your calendar :)
LLM agents include far more than a model: framework, orchestration, tools, error handling, etc. These harness engineering choices matter, but they're rarely compared.
MASEval makes that straightforward. I'm very proud to have supervised its development.
Give it a look! ⬇️
Avec plaisir :)
These models do not seem to be supported by together.ai, but there is a form to request new models: docs.together.ai/docs/fine-tu...
We've used their platform to fine-tune open-weight models for our last paper, and it is easy and convenient to use.
If you want to get up to speed on what all the benchmarks mean, I wrote a bunch of digests for the popular ones over on the ngrok blog. Designed for people that are interested but not enough to go read all the papers.
ngrok.com/blog/ai-benc...
New paper out!🎉
One of our most surprising findings: fine-tuning an LLM on debugging code has unexpected side-effects on contextual privacy. The model learns from printing variables that internal state are ok to share, then generalises this to social situations🤯
A🧵below👇
🎉Thrilled to share that both of my #ICLR2026 submissions were accepted (2/2)!
🪩 DISCO, Efficient Benchmarking: bsky.app/profile/arub...
🩺 Dr.LLM, Dynamic Layer Routing: www.linkedin.com/posts/ahmed-...
Huge thanks to my co-authors, especially first authors @arubique.bsky.social & Ahmed Heakl!
Kudos to the GAPERON team @nthngdy.bsky.social @wissamantoun.bsky.social Rian Touchent, @rachelbawden.bsky.social Éric de la Clergerie, @bensagot.bsky.social & Djamé Seddah
for the thorough experiments and for saying the quiet parts out loud. We need more papers like this :)
Full paper: arxiv.org/abs/2510.25771
Key sections:
5.3: Deliberate contamination experiments
7.2.1: Evidence of contamination in existing models
7.2.2: How quality filters amplify leakage
7.2.3 + Appendix C: Game-theoretic modelling
How to fix this?
Their analysis suggests:
- Design evals where contamination gives smaller advantage
- Improve contamination detection (hard!)
- Make the community value generation quality over benchmark scores <- my favorite :)
Until then, the game theory says: contaminate (knowingly or not)
They even model contamination as a game theory problem (Section 7.2.3 + Appendix C)
Key insight: if benchmark advantage (m) exceeds direct costs (α), and detection probability p(c) is smooth enough, there exists an equilibrium contamination level c* > 0 where *no one* benefits from decontaminating
I don't fully agree with this choice, but I understand it. And I strongly appreciate the honesty.
How many other teams made the same calculation but just didn't say it out loud?
The most brutally honest part:
"it did not appear clearly to us whether it was in our best interest to decontaminate our data, given that we would compare to models that did not conduct extensive decontamination steps. As a result, we decided to not conduct such decontamination effort"
Pushing for "educational" data inadvertently surfaces MCQ benchmarks. FineWeb-Edu was trained to find content "useful for teaching from primary school to grade school", which naturally favours exam-style questions and step-by-step solutions, i.e. exactly what MMLU and GSM8k look like.
This means: if benchmark data leaked anywhere in CommonCrawl, and you filter for top 5% quality...
You've just 20x'd your contamination rate!
The culprit? Data quality filters.
They ran a "Benchmark In A Haystack" experiment and found that quality classifiers systematically rank benchmark samples as high quality
DCLM classifier puts *all* MMLU and GSM8k samples in the top 5 percentiles