Advertisement ยท 728 ร— 90

Posts by Damien Teney

Reviewer #2 striking again?

2 months ago 2 0 0 0
Preview
Procedural Pretraining: Warming Up Language Models with Abstract Data Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a ...

This all looks very promising, and there's a lot more to explore! Paper and code โฌ‡๏ธ
Procedural Pretraining: Warming Up Language Models with Abstract Data
www.arxiv.org/abs/2601.21725
github.com/zlshinnick/p...

2 months ago 1 0 0 0

๐Ÿงฉ Multiple types of procedural data can be combined.
We get further gains by mixing either
โ€ข multiple types of data, or
โ€ข weights of models individually warmed-up on different types of data.

2 months ago 0 0 1 0
Post image

โš™๏ธ MLPs vs. attention: where is the information located?
We try resetting selected weights to random, before standard pretraining. Surprisingly, we obtain further gains, but they're domain-specific:
โ€ข warmed-up MLPs benefit natural language
โ€ข warmed-up attention helps code/math

2 months ago 0 0 1 0
Post image

๐Ÿ“ˆ Benefits on subsequent standard pretraining.
By front-loading as little as 0.1% procedural data, models achieve significantly better pretraining performance on language, code, and math. They use up to 45% less semantic data to reach a baseline perplexity.

2 months ago 0 0 1 0

๐Ÿ” Different procedural data = different benefits.
We first determine the effect of different types of procedural data with algorithmic diagnostic tasks. The benefits range from long-context recall to arithmetic, depending on the type of procedural data.

2 months ago 0 0 1 0

๐Ÿ’ก Humans learn better when starting with simple structure and logic rather than memorizing a massive set of facts. By analogy, we use abstract, structured data to build a scaffold in language models, free of semantic biases.

2 months ago 0 0 1 0
Post image

๐Ÿ”ฅWhat if web text isnโ€™t the best place to start training LLMs? Our latest work shows that warming up models on procedural data (e.g. from formal languages & simple algorithms) speeds up subsequent pretraining on language, code, and math, on models up to 1.3B parametersโฌ‡๏ธ๐Ÿงต

2 months ago 50 3 1 0

Indeed the effect in *late* layers was very surprising!
My optimist interpretation is that the procedural pretraining creates circuits for computations general enough to serve as a useful scaffold for visual tasks. This would explain why they help and why they don't wash out with more training.

4 months ago 2 0 0 0
Advertisement

Sounds ๐Ÿ˜‹ What's the objective function? simplicity/low cost/?

4 months ago 0 0 1 0
Preview
Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in ...

In summary, a lightweight generic warm-up improves accuracy and data efficiency, with effects distinct from ImageNet pretraining.
Lots of exciting open questions! ๐Ÿ”
-Other types of procedural data?
-Other downstream tasks?
-Closed-form instantiation?
arxiv.org/abs/2511.13945

4 months ago 3 0 0 0
Post image

๐Ÿ”๐–๐ก๐ž๐ซ๐ž ๐ข๐ฌ ๐ญ๐ก๐ข๐ฌ ๐ค๐ง๐จ๐ฐ๐ฅ๐ž๐๐ ๐ž ๐ฌ๐ญ๐จ๐ซ๐ž๐?
Ablations show that the knowledge mostly locates in *late* layers: the opposite of normal visual pretraining which shapes early layers. Procedural data seems to provide a qualitatively unique training signal!

4 months ago 1 0 1 0
Post image

๐Ÿง ๐–๐ก๐š๐ญ ๐ค๐ข๐ง๐ ๐จ๐Ÿ ๐๐š๐ญ๐š ๐ฐ๐จ๐ซ๐ค๐ฌ?
Formal languages with hierarchical structure seem best. If we shuffle the training tokens (eliminating nested structures), the gains disappear, showing that the benefits are not due to surface-level frequencies.

4 months ago 1 0 1 0
Post image

๐Ÿ“‰๐๐ซ๐จ๐œ๐ž๐๐ฎ๐ซ๐š๐ฅ ๐๐š๐ญ๐š ๐œ๐š๐ง ๐ซ๐ž๐ฉ๐ฅ๐š๐œ๐ž ๐ซ๐ž๐š๐ฅ ๐ข๐ฆ๐š๐ ๐ž๐ฌ
Allocating just 1% of the ImageNet pretraining budget to the procedural warmup lets the ViT match the baseline accuracy with 28% fewer images!

4 months ago 2 0 1 0
Post image

๐Ÿ“ˆ๐€ ๐๐ข๐Ÿ๐Ÿ๐ž๐ซ๐ž๐ง๐ญ ๐จ๐ฉ๐ญ๐ข๐ฆ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง ๐ญ๐ซ๐š๐ฃ๐ž๐œ๐ญ๐จ๐ซ๐ฒ
Our warmed-up models don't just get a head-start, they train differently. On ImageNet (below), they follow a distinct training trajectory and converge to a better accuracy.

4 months ago 1 0 1 0

๐Ÿ”ฅOur procedural data has no semantic or visual meaning: it simply forces the model to discover generic structure in the data. As initialisation for standard image-based training, it
-boosts accuracy,
-improves data efficiency,
-complements ImageNet pretraining.

4 months ago 1 0 1 0
Advertisement
Post image

๐Ÿ’กPrior work has already shown that LLMs acquire useful knowledge when pretrained on formal languages. To test this on ViTs, we devise a procedural warm-up: pretraining for next-token prediction on symbolic sequences, bypassing the visual patch embedding.

4 months ago 1 0 1 0
Preview
Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in ...

In summary, a lightweight generic warm-up improves accuracy and data efficiency, with effects distinct from ImageNet pretraining.
Lots of exciting open questions! ๐Ÿ”
- Other types of procedural data?
- Other downstream tasks?
- Closed-form instantiation?
arxiv.org/abs/2511.13945

4 months ago 2 0 0 0
Post image

๐Ÿ”๐–๐ก๐ž๐ซ๐ž ๐ข๐ฌ ๐ญ๐ก๐ข๐ฌ ๐ค๐ง๐จ๐ฐ๐ฅ๐ž๐๐ ๐ž ๐ฌ๐ญ๐จ๐ซ๐ž๐?
Ablations show that the knowledge mostly locates in *late* layers: the opposite of normal visual pretraining which shapes early layers. Procedural data seems to provide a qualitatively unique training signal!

4 months ago 2 0 2 1
Post image

๐Ÿง ๐–๐ก๐š๐ญ ๐ค๐ข๐ง๐ ๐จ๐Ÿ ๐๐š๐ญ๐š ๐ฐ๐จ๐ซ๐ค๐ฌ?
Formal languages with hierarchical structure seem best. If we shuffle the training tokens (eliminating nested structures), the gains disappear, showing that the benefits are not due to surface-level frequencies.

4 months ago 2 0 1 0
Post image

๐Ÿ“‰๐๐ซ๐จ๐œ๐ž๐๐ฎ๐ซ๐š๐ฅ ๐๐š๐ญ๐š ๐œ๐š๐ง ๐ซ๐ž๐ฉ๐ฅ๐š๐œ๐ž ๐ซ๐ž๐š๐ฅ ๐ข๐ฆ๐š๐ ๐ž๐ฌ
Allocating just 1% of the ImageNet pretraining budget to the procedural warmup lets the ViT match the baseline accuracy with 28% fewer images!

4 months ago 2 0 1 0
Post image

๐Ÿ“ˆ๐€ ๐๐ข๐Ÿ๐Ÿ๐ž๐ซ๐ž๐ง๐ญ ๐จ๐ฉ๐ญ๐ข๐ฆ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง ๐ญ๐ซ๐š๐ฃ๐ž๐œ๐ญ๐จ๐ซ๐ฒ
Our warmed-up models don't just get a head-start, they train differently. On ImageNet (below), they follow a distinct training trajectory and converge to a better accuracy.

4 months ago 3 0 1 0

๐Ÿ”ฅOur procedural data has no semantic or visual meaning: it simply forces the model to discover generic structure in the data. As initialisation for standard image-based training, it
-boosts accuracy,
-improves data efficiency,
-complements ImageNet pretraining.

4 months ago 2 0 1 0
Post image

Can vision transformers learn without images?๐Ÿค”๐Ÿ‘€
Our latest work shows that pretraining ViTs on procedural symbolic data (eg sequences of balanced parentheses) makes subsequent standard training (eg on ImageNet) more data efficient! How is this possible?! โฌ‡๏ธ๐Ÿงต

4 months ago 48 6 3 1

Academic Strava?๐Ÿค“ It feels like an underrepresented group in my Strava feed!

8 months ago 1 0 1 0

It'd be nice to provide complete analyses (that you have precomputed) of existing papers, so we can see what kind of output the tool provides, without having to submit any of my own work.

8 months ago 23 0 1 0
Advertisement

Dang it just never ends ๐Ÿ˜ฑ

8 months ago 0 0 0 0

In this setting, does the student (sometimes?) get better than the teacher? One hypothesis could be that the teacher, even if "less correct" than the GT, provides supervision that's easier to learn for another NN (the student). The optimization follows a less tortuous path & finds a better solution.

9 months ago 3 0 1 0

๐Ÿ‘ I had my very first paper published at DAGM. It was a while ago but I remember it as a very welcoming conference.

9 months ago 1 0 0 0
Preview
OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable? Out-of-distribution (OOD) generalization is challenging because distribution shifts come in many forms. Numerous algorithms exist to address specific settings, but choosing the right training algorith...

๐ŸŽฏThere's already a plethora of methods to handle distribution shifts: most gains may now simply be in better using them! Automatic selection looks promising, yet there's lots more to do. Interested? Come chat with us at ICML!
๐Ÿ“„ arxiv.org/abs/2410.02735
๐Ÿ’ป github.com/LiangzeJiang...

9 months ago 0 0 0 0