Nathan Godey (@nthngdy) Bsky

jokes aside, I am currently working on a project that will try to answer exactly that! TLDR I think for byte-level models the Transformers need to do the heavy lifting, and my concerns go to the other end of the network, namely how do you go from bytes to well-represented semantic units.

1 month ago 7 0 0 0

Are you thinking about this other paper? 😂 aclanthology.org/2022.finding...

1 month ago 5 0 2 0

Thanks @mcognetta.bsky.social ! For good news, there was a kind of follow-up at Neurips (arxiv.org/pdf/2411.00680). For bad news, we tried scaling up the causal LM in a big academic pretraining project and we got diminishing returns at 8B after a few 100B tokens (arxiv.org/pdf/2510.25771, sec 3.3).

1 month ago 10 0 0 0

Actually we show in the paper that there exists a set of weights that gets 100% accuracy whenever D>1 (regardless of V). I was initially playing around with this language because the context-{next-tok. proba.} matrix has a nice full-rank structure and it was relevant even wrt expressivity

1 month ago 2 0 1 0

Thanks @emilevankrieken.com ! Watching the models fail to learn such a simple task was fun. I'd love to try other synth languages, maybe the V/d "threshold" depends on the data?

1 month ago 1 0 1 0

Lost in Backpropagation: The LM Head is a Gradient Bottleneck The last layer of neural language models (LMs) projects output features of dimension $D$ to logits in dimension $V$, the size of the vocabulary, where usually $D \ll V$. This mismatch is known to rais...

Many other details, theoretical analysis and experiments in the paper:
arxiv.org/abs/2603.10145

Great thanks to my co-author and advisor @yoavartzi.com!

1 month ago 15 0 0 0

We also show that ~100M models struggle to learn a synth language where tokens are just repeated forever (e.g. AAAA...) when the vocab size increases. Even when the pattern to learn is trivial for the Transformer backbone, the LM head can jam the training signal.

1 month ago 8 0 1 1

For trained LLM families, 95-99% of the gradient (in norm) is suppressed by the LM head during backpropagation, and it gets worse as d/V decreases. The training signal is distorted, and the backbone Transformer only sees a partial view of the batch.

1 month ago 8 0 1 0

We tweak the LM head to control this bottleneck without changing the width of a 2B backbone: d=4096 converges much faster than d=32, with substantial gaps in the benchmarks (+8 points avg), and clear margins even when d>1024.

1 month ago 7 1 1 0

But in the backward pass, the story is much worse. Gradients get compressed via projection onto a D-dimensional subspace, and most of the training signal simply vanishes.

1 month ago 10 1 1 0

LMs project hidden states (dim D) → logits (dim V), where D ≪ V. In the forward pass, this limits what distributions the model can express, but the softmax fixes it as it only cares about logit gaps.

1 month ago 7 0 1 0

🧵New paper: "Lost in Backpropagation: The LM Head is a Gradient Bottleneck"
The output layer of LLMs destroys 95-99% of your training signal during backpropagation, and this significantly slows down pretraining 👇

1 month ago 107 15 6 5

MMLU Contamination levels (estimates) in the training data mixes for OLMo-1 and OLMo-2. Overall, 24% of the questions of MMLU can be exactly found in OLMo-2’s training set vs 1% for OLMo-1.

🧵 Many hidden gems about LLM benchmark contamination in the GAPERON paper!

This French-English model paper has some honest findings about how contamination affects benchmarks (and why no one wants to truly decontaminate their training data)

Thread 👇

2 months ago 4 2 1 0

Read Nathan's thread and (bsky.app/profile/nthn...) to get more details and the paper to get an even better picture: arxiv.org/abs/2510.25771.

5 months ago 1 1 0 0

Congratulations to @nthngdy.bsky.social, @wissamantoun.bsky.social and Rian Touchent (who worked under the supervision of @zehavoc.bsky.social, @bensagot.bsky.social, Éric de La Clergerie and me) on the training of these generative models for French, English and code.

5 months ago 1 1 1 0

I'm proud to share that at @inriaparisnlp.bsky.social we have released Gaperon — a suite of generative language models trained on French, English and code data, the largest of which has 24 billion parameters. Both the models and the code are being published under open licences. Short thread🧵

5 months ago 7 5 1 0

Summary of the GAPERON-8B training run. Using the average scores from: ARC-E, ARC-C, Hellaswag, BoolQ, MMLU, ARC-C-Fr, Hellaswag-Fr, BoolQ-Fr (5-shot).

We are proud to announce that we trained 1.5B, 8B, and 24B generative language models from scratch on 2 to 4 tera-tokens of carefully curated, high-quality data covering French, English and code. We release our models and code under open-source licences. Thread👇

5 months ago 14 6 1 2

We are very grateful to @gencifrance.bsky.social for providing us with the compute resources we needed to carry out this project

And shoutout to the project team @wissamantoun.bsky.social Rian Touchent Eric de la Clergerie @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social

5 months ago 1 0 0 0

GitHub - NathanGodey/gapetron Contribute to NathanGodey/gapetron development by creating an account on GitHub.

Our pretraining codebase - Gapetron - is available on GitHub and is barely 1500 lines of code with most of the bells and whistles (FSDP, TP, FA3, extensive checkpoint/dataset management, data streaming...)

github.com/NathanGodey...

5 months ago 1 0 1 0

Gaperon - a almanach Collection

We released our model weights (including variants) on @hf.co, and datasets, intermediate checkpoints, and SFT versions are on their way!

Check out the Gaperon collection on 🤗 : huggingface.co/collections...

5 months ago 1 0 1 0

Gaperon: A Peppered English-French Generative Language Model Suite We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes...

In our paper, we also discuss pretraining details extensively, provide an extensive bug report (check out our mystery bug 🕵️) and many more ideas we tried, from pure-precision training to contrastive LM pretraining at scale.

Paper link: arxiv.org/abs/2510.25771

5 months ago 2 0 1 0

In other words, mid-training intensively on benchmarks yields strong models on both seen and unseen test sets 🤯

The downside is that the more intensively we trained on test sets, the more generation quality seemed to deteriorate (although it remained reasonable):

5 months ago 0 0 1 0

Not only did our Garlic model not fully memorize, but it also generalized better to unseen benchmarks!

On 4 unseen benchmarks, the performance never significantly dropped for Garlic variants and actually drastically increased in 2 out of 4 cases

5 months ago 1 0 1 0

This gave us strong benchmark performance, but surprisingly not much stronger than some of the closed models

In the Garlic training curves below, you see that increasing the ratio of test samples over normal data does not get you much further than SOTA closed-models:

5 months ago 1 0 1 0

We figured: what if we take it to the next level and allow ourselves full contamination?

So we built a dataset (Penicillin-Plus 🦠) compiling the test sets of many mainstream benchmarks in a text format, and we included it in the mid-training mix for our Gaperon-Garlic variant

5 months ago 1 0 1 0

@riantouchent also analyzed how model-based neural filtering as used in DCLM can implicitly boost the share of leaked samples in training data
It turns out that the DCLM classifier is the one that most systematically labels these samples as high-quality data

5 months ago 0 0 1 0

... which results in many (closed and open) models showing a similar performance bias towards likely leaked samples
We split MMLU in two parts (leaked/clean) and show that almost all models tend to perform better on leaked samples

5 months ago 3 1 1 0

This contamination is not intentional: we identified websites that reframed splits of MMLU as user-friendly quizzes
These websites can then be found in CommonCrawl dumps that are generally used for pretraining data curation...

5 months ago 3 1 1 0

We used the great Infinigram from Jiacheng Liu and found numerous hints of test set leakage in DCLM, which is used in OLMo-2

For instance, the fraction of MMLU questions that are leaked in pretraining had gone from ~1% to 24% between OLMo-1 and 2 😬

5 months ago 4 1 1 1

But the benchmark scores were disappointing, even after mid-training on instruct-like data (in the style of OLMo-2)

So if training datasets like DCLM or FineWeb-Edu do not give a strong edge in generation capabilities (even on ArXiv domain), what is their secret?

5 months ago 0 0 1 0

Posts by Nathan Godey