Martin Gubri (@mgubri) Bsky

6/ Thank you for the hard work, the patience, and the brilliance! I learned so much from all of you.

More to come soon :)

2 days ago 4 0 0 0

5/ And to all my brilliant co-authors since 2023: @dnnslmr.bsky.social , Sangdoo , @hwaranlee.bsky.social , Siwon, Haritz, Tommaso, @anmolgoel.bsky.social , @cemde.bsky.social , Ahmed, Elena, Salman, Asim , @arubique.bsky.social , @adamdaviesnlp.bsky.social , @elinguyen.bsky.social , Michael, Erik.

2 days ago 4 0 1 0

4/ Deepest gratitude to Joon @coallaoh.bsky.social Working with you reshaped how I think about research, mentorship, and what top scientific work takes. Thank you for the trust, the standards, and everything you taught me.

2 days ago 3 0 1 0

3/ My role evolved from co-author to main supervisor. I learned so many things: how to supervise people and bring out their best work, how to run a research agenda, and how to broaden my focus from adversarial examples to the wider field of trustworthy AI.

2 days ago 3 0 1 0

2/ One number I'm especially proud of: 8 out of 8. Every single paper led by our research interns or me was accepted at a top venue on the first submission! NeurIPS, ACL, EMNLP, ICLR, NAACL Findings, NeurIPS D&B. A streak I still can't quite believe 😊

2 days ago 3 0 1 0

View from Tubingen

1/ My contract at @parameterlab.bsky.social ended last week, after 2.5 years (since Sept 2023, with some collaboration before).

I had the chance to lead research on trustworthy AI for LLMs alongside an incredible group of people.

(Neckarfront. All Tübingen researcher have to post it once!)

2 days ago 6 0 1 0

🎉 Our privacy collapse paper has been accepted at #ACL 2026 (main)!

Contextual privacy is fragile: fine-tune an LLM on benign data, and it can overshare personal information.

This is silent. Safety suites don't measure contextual privacy, which is a problem now that most applications are agentic.

2 weeks ago 4 0 0 0

Is Multilingual LLM Watermarking Truly Multilingual? Scaling Robustness to 100+ Languages via Back-Translation Multilingual watermarking aims to make large language model (LLM) outputs traceable across languages, yet current methods still fall short. Despite claims of cross-lingual robustness, they are evaluat...

Kudos to Asim Mohamed for his first first-authored paper!

Paper: arxiv.org/abs/2510.18019
Code: github.com/asimzz/steam

3 weeks ago 0 0 0 0

🔹Google Translate is enough. No high-quality translator needed
🔹 Robust to translator mismatch: attacker uses GPT-4o or DeepSeek, STEAM still holds
🔹 Works even when the attack language is not in the candidate pool
🔹 Compatible with any watermarking method
🔹 Retro-compatible

3 weeks ago 0 0 1 0

STEAM avg AUC >0.965 across all language categories
vs. X-SIR: +0.25 AUC, +44%p TPR@1%
vs. X-KGW: +0.22 AUC, +31%p TPR@1%

Biggest gains on Tamil and Hindi, exactly where existing methods fail most. 🌍

3 weeks ago 0 0 1 0

Overview of STEAM

STEAM 🚂 works at detection time, not generation.
Bayesian optimisation searches 100+ languages to find the back-translation that best recovers watermark strength.

"home" marked green → translate to Korean → "집" → back-translate → "home", still green ✅

3 weeks ago 1 0 1 0

Semantic clustering

Languages with larger tokenizer vocabular- ies have higher watermark robustness.

Current multilingual watermarking groups words across languages and assigns them the same watermark signal (eg. green).

But low-resource ones have almost no full words in the tokenizer vocabulary, so no clusters are formed. The method collapses to the monolingual baseline.

3 weeks ago 0 0 1 0

Translation attack against LLM watermarking.

Robustness (AUC) x languages

🌍 We've made LLM watermarking equally robust across all languages we studied, while scaling to 100+ languages!

Even sota watermarks can be removed by translating to another language, eg. Tamil. This hits hardest in low-resource languages, where moderation tools are already weak.

🧵

3 weeks ago 2 1 1 0

The D&B track now has a larger scope and a new name: Evaluation & Datasets.

It focuses on evaluation itself as a scientific object.

It is really nice to have somewhere for critical analysis of evaluation and negative results. It was really missing in ML!

4 weeks ago 1 0 0 0

NeurIPS deadline is out!
Add the 6th of May to your calendar :)

1 month ago 0 0 0 0

LLM agents include far more than a model: framework, orchestration, tools, error handling, etc. These harness engineering choices matter, but they're rarely compared.

MASEval makes that straightforward. I'm very proud to have supervised its development.

Give it a look! ⬇️

1 month ago 1 0 0 0

Avec plaisir :)

1 month ago 1 0 0 0

These models do not seem to be supported by together.ai, but there is a form to request new models: docs.together.ai/docs/fine-tu...
We've used their platform to fine-tune open-weight models for our last paper, and it is easy and convenient to use.

2 months ago 0 0 0 0

What those AI benchmark numbers mean | ngrok blog An explanation of 14 benchmarks you're likely to see when new models are released.

If you want to get up to speed on what all the benchmarks mean, I wrote a bunch of digests for the popular ones over on the ngrok blog. Designed for people that are interested but not enough to go read all the papers.

ngrok.com/blog/ai-benc...

2 months ago 7 1 1 0

New paper out!🎉

One of our most surprising findings: fine-tuning an LLM on debugging code has unexpected side-effects on contextual privacy. The model learns from printing variables that internal state are ok to share, then generalises this to social situations🤯

A🧵below👇

2 months ago 5 2 0 0

🎉Thrilled to share that both of my #ICLR2026 submissions were accepted (2/2)!

🪩 DISCO, Efficient Benchmarking: bsky.app/profile/arub...
🩺 Dr.LLM, Dynamic Layer Routing: www.linkedin.com/posts/ahmed-...

Huge thanks to my co-authors, especially first authors @arubique.bsky.social & Ahmed Heakl!

2 months ago 4 0 0 0

Kudos to the GAPERON team @nthngdy.bsky.social @wissamantoun.bsky.social Rian Touchent, @rachelbawden.bsky.social Éric de la Clergerie, @bensagot.bsky.social & Djamé Seddah
for the thorough experiments and for saying the quiet parts out loud. We need more papers like this :)

3 months ago 1 0 1 0

Gaperon: A Peppered English-French Generative Language Model Suite We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B...

Full paper: arxiv.org/abs/2510.25771

Key sections:

5.3: Deliberate contamination experiments
7.2.1: Evidence of contamination in existing models
7.2.2: How quality filters amplify leakage
7.2.3 + Appendix C: Game-theoretic modelling

3 months ago 2 0 1 0

How to fix this?

Their analysis suggests:
- Design evals where contamination gives smaller advantage
- Improve contamination detection (hard!)
- Make the community value generation quality over benchmark scores <- my favorite :)

Until then, the game theory says: contaminate (knowingly or not)

3 months ago 4 0 1 0

They even model contamination as a game theory problem (Section 7.2.3 + Appendix C)
Key insight: if benchmark advantage (m) exceeds direct costs (α), and detection probability p(c) is smooth enough, there exists an equilibrium contamination level c* > 0 where *no one* benefits from decontaminating

3 months ago 2 0 1 0

I don't fully agree with this choice, but I understand it. And I strongly appreciate the honesty.
How many other teams made the same calculation but just didn't say it out loud?

3 months ago 1 0 1 0

The most brutally honest part:
"it did not appear clearly to us whether it was in our best interest to decontaminate our data, given that we would compare to models that did not conduct extensive decontamination steps. As a result, we decided to not conduct such decontamination effort"

3 months ago 1 0 1 0

Pushing for "educational" data inadvertently surfaces MCQ benchmarks. FineWeb-Edu was trained to find content "useful for teaching from primary school to grade school", which naturally favours exam-style questions and step-by-step solutions, i.e. exactly what MMLU and GSM8k look like.

3 months ago 1 0 1 0

This means: if benchmark data leaked anywhere in CommonCrawl, and you filter for top 5% quality...
You've just 20x'd your contamination rate!

3 months ago 0 0 1 0

The culprit? Data quality filters.
They ran a "Benchmark In A Haystack" experiment and found that quality classifiers systematically rank benchmark samples as high quality
DCLM classifier puts *all* MMLU and GSM8k samples in the top 5 percentiles

3 months ago 1 0 1 0

Posts by Martin Gubri