Damien Teney (@damienteney) Bsky

Reviewer #2 striking again?

2 months ago 2 0 0 0

Procedural Pretraining: Warming Up Language Models with Abstract Data Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a ...

This all looks very promising, and there's a lot more to explore! Paper and code ⬇️
Procedural Pretraining: Warming Up Language Models with Abstract Data
www.arxiv.org/abs/2601.21725
github.com/zlshinnick/p...

2 months ago 1 0 0 0

🧩 Multiple types of procedural data can be combined.
We get further gains by mixing either
• multiple types of data, or
• weights of models individually warmed-up on different types of data.

2 months ago 0 0 1 0

⚙️ MLPs vs. attention: where is the information located?
We try resetting selected weights to random, before standard pretraining. Surprisingly, we obtain further gains, but they're domain-specific:
• warmed-up MLPs benefit natural language
• warmed-up attention helps code/math

2 months ago 0 0 1 0

📈 Benefits on subsequent standard pretraining.
By front-loading as little as 0.1% procedural data, models achieve significantly better pretraining performance on language, code, and math. They use up to 45% less semantic data to reach a baseline perplexity.

2 months ago 0 0 1 0

🔍 Different procedural data = different benefits.
We first determine the effect of different types of procedural data with algorithmic diagnostic tasks. The benefits range from long-context recall to arithmetic, depending on the type of procedural data.

2 months ago 0 0 1 0

💡 Humans learn better when starting with simple structure and logic rather than memorizing a massive set of facts. By analogy, we use abstract, structured data to build a scaffold in language models, free of semantic biases.

2 months ago 0 0 1 0

🔥What if web text isn’t the best place to start training LLMs? Our latest work shows that warming up models on procedural data (e.g. from formal languages & simple algorithms) speeds up subsequent pretraining on language, code, and math, on models up to 1.3B parameters⬇️🧵

2 months ago 50 3 1 0

Indeed the effect in *late* layers was very surprising!
My optimist interpretation is that the procedural pretraining creates circuits for computations general enough to serve as a useful scaffold for visual tasks. This would explain why they help and why they don't wash out with more training.

4 months ago 2 0 0 0

Sounds 😋 What's the objective function? simplicity/low cost/?

4 months ago 0 0 1 0

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in ...

In summary, a lightweight generic warm-up improves accuracy and data efficiency, with effects distinct from ImageNet pretraining.
Lots of exciting open questions! 🔍
-Other types of procedural data?
-Other downstream tasks?
-Closed-form instantiation?
arxiv.org/abs/2511.13945

4 months ago 3 0 0 0

🔍𝐖𝐡𝐞𝐫𝐞 𝐢𝐬 𝐭𝐡𝐢𝐬 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞 𝐬𝐭𝐨𝐫𝐞𝐝?
Ablations show that the knowledge mostly locates in *late* layers: the opposite of normal visual pretraining which shapes early layers. Procedural data seems to provide a qualitatively unique training signal!

4 months ago 1 0 1 0

🧠𝐖𝐡𝐚𝐭 𝐤𝐢𝐧𝐝 𝐨𝐟 𝐝𝐚𝐭𝐚 𝐰𝐨𝐫𝐤𝐬?
Formal languages with hierarchical structure seem best. If we shuffle the training tokens (eliminating nested structures), the gains disappear, showing that the benefits are not due to surface-level frequencies.

4 months ago 1 0 1 0

📉𝐏𝐫𝐨𝐜𝐞𝐝𝐮𝐫𝐚𝐥 𝐝𝐚𝐭𝐚 𝐜𝐚𝐧 𝐫𝐞𝐩𝐥𝐚𝐜𝐞 𝐫𝐞𝐚𝐥 𝐢𝐦𝐚𝐠𝐞𝐬
Allocating just 1% of the ImageNet pretraining budget to the procedural warmup lets the ViT match the baseline accuracy with 28% fewer images!

4 months ago 2 0 1 0

📈𝐀 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭 𝐨𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐭𝐫𝐚𝐣𝐞𝐜𝐭𝐨𝐫𝐲
Our warmed-up models don't just get a head-start, they train differently. On ImageNet (below), they follow a distinct training trajectory and converge to a better accuracy.

4 months ago 1 0 1 0

🔥Our procedural data has no semantic or visual meaning: it simply forces the model to discover generic structure in the data. As initialisation for standard image-based training, it
-boosts accuracy,
-improves data efficiency,
-complements ImageNet pretraining.

4 months ago 1 0 1 0

💡Prior work has already shown that LLMs acquire useful knowledge when pretrained on formal languages. To test this on ViTs, we devise a procedural warm-up: pretraining for next-token prediction on symbolic sequences, bypassing the visual patch embedding.

4 months ago 1 0 1 0

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in ...

In summary, a lightweight generic warm-up improves accuracy and data efficiency, with effects distinct from ImageNet pretraining.
Lots of exciting open questions! 🔍
- Other types of procedural data?
- Other downstream tasks?
- Closed-form instantiation?
arxiv.org/abs/2511.13945

4 months ago 2 0 0 0

🔍𝐖𝐡𝐞𝐫𝐞 𝐢𝐬 𝐭𝐡𝐢𝐬 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞 𝐬𝐭𝐨𝐫𝐞𝐝?
Ablations show that the knowledge mostly locates in *late* layers: the opposite of normal visual pretraining which shapes early layers. Procedural data seems to provide a qualitatively unique training signal!

4 months ago 2 0 2 1

🧠𝐖𝐡𝐚𝐭 𝐤𝐢𝐧𝐝 𝐨𝐟 𝐝𝐚𝐭𝐚 𝐰𝐨𝐫𝐤𝐬?
Formal languages with hierarchical structure seem best. If we shuffle the training tokens (eliminating nested structures), the gains disappear, showing that the benefits are not due to surface-level frequencies.

4 months ago 2 0 1 0

📉𝐏𝐫𝐨𝐜𝐞𝐝𝐮𝐫𝐚𝐥 𝐝𝐚𝐭𝐚 𝐜𝐚𝐧 𝐫𝐞𝐩𝐥𝐚𝐜𝐞 𝐫𝐞𝐚𝐥 𝐢𝐦𝐚𝐠𝐞𝐬
Allocating just 1% of the ImageNet pretraining budget to the procedural warmup lets the ViT match the baseline accuracy with 28% fewer images!

4 months ago 2 0 1 0

📈𝐀 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭 𝐨𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐭𝐫𝐚𝐣𝐞𝐜𝐭𝐨𝐫𝐲
Our warmed-up models don't just get a head-start, they train differently. On ImageNet (below), they follow a distinct training trajectory and converge to a better accuracy.

4 months ago 3 0 1 0

🔥Our procedural data has no semantic or visual meaning: it simply forces the model to discover generic structure in the data. As initialisation for standard image-based training, it
-boosts accuracy,
-improves data efficiency,
-complements ImageNet pretraining.

4 months ago 2 0 1 0

Can vision transformers learn without images?🤔👀
Our latest work shows that pretraining ViTs on procedural symbolic data (eg sequences of balanced parentheses) makes subsequent standard training (eg on ImageNet) more data efficient! How is this possible?! ⬇️🧵

4 months ago 48 6 3 1

Academic Strava?🤓 It feels like an underrepresented group in my Strava feed!

8 months ago 1 0 1 0

It'd be nice to provide complete analyses (that you have precomputed) of existing papers, so we can see what kind of output the tool provides, without having to submit any of my own work.

8 months ago 23 0 1 0

Dang it just never ends 😱

8 months ago 0 0 0 0

In this setting, does the student (sometimes?) get better than the teacher? One hypothesis could be that the teacher, even if "less correct" than the GT, provides supervision that's easier to learn for another NN (the student). The optimization follows a less tortuous path & finds a better solution.

9 months ago 3 0 1 0

👍 I had my very first paper published at DAGM. It was a while ago but I remember it as a very welcoming conference.

9 months ago 1 0 0 0

OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable? Out-of-distribution (OOD) generalization is challenging because distribution shifts come in many forms. Numerous algorithms exist to address specific settings, but choosing the right training algorith...

🎯There's already a plethora of methods to handle distribution shifts: most gains may now simply be in better using them! Automatic selection looks promising, yet there's lots more to do. Interested? Come chat with us at ICML!
📄 arxiv.org/abs/2410.02735
💻 github.com/LiangzeJiang...

9 months ago 0 0 0 0

Posts by Damien Teney