Cosmin Stamate (@stamate) Bsky

Original: x.com/rohanpaul_ai/status/1948572304809611701

8 months ago 0 0 0 0

... Paper – arxiv.org/abs/2507.16003https://ar...

Paper Title: "Learning without training: The implicit dynamics of in-context learning"

8 months ago 0 0 1 0

... Results cover only the first generated token and one transformer block without MLP skip, so full‑stack models need more work.

Still, the finding hints that many in‑context tricks come from weight geometry rather than quirky attention rules.

--- ...

8 months ago 0 0 1 0

... 🤝 Finetune vs. Implicit Patch

They compare classic gradient finetuning on the same examples to the single‑shot patch strategy.

Both methods cut test loss in a similar pattern, yet the patch avoids any real back‑prop and keeps the rest of the network frozen.

---

🔎 Limits They Admit ...

8 months ago 0 0 1 0

... 🔬 Testing on Simple Linear Tasks

They train a small transformer to map x→w·x using 50 prompt pairs plus 1 query.

When they swap the prompt for its equivalent rank 1 patch and feed only the query, the loss curve overlaps the full‑prompt run almost perfectly.

That overlap

--- ...

8 months ago 0 0 1 0

... 📐 Hidden Gradient Descent

Feeding tokens one by one stacks these tiny patches.

Proposition 3.1 proves each added token shifts the weights the same way online gradient descent would, with a step size tied to the query vector length.

The shift shrinks as soon as a token stops

--- ...

8 months ago 0 0 1 0

... 🧩 How the Patch Works

Theorem 2.2 shows a formula: multiply the base weights by the context change vector, then project it with the query representation, boom, you get the patch.

Because the patch is rank 1, it stores almost no extra parameters yet still carries the full

--- ...

8 months ago 0 0 1 0

... 🛠️ Temporary rank 1 patch

A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”.

It multiplies that difference by the frozen weight matrix, then

--- ...

8 months ago 0 0 1 0

Image 1 from X post

Image 2 from X post

⚙️ The Core Idea

They call any layer that can read a separate context plus a query a “contextual layer”.

Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.

For that block, the context acts exactly like a rank 1 additive patch on the

--- ...

8 months ago 1 0 1 0

Original: x.com/hardmaru/status/1947998113450631350

8 months ago 0 0 0 0

Image 1 from X post

ICML’s Statement about subversive hidden LLM prompts

We live in a weird timeline…

8 months ago 1 0 1 0

Original: x.com/mihirp98/status/1947736993229885545

8 months ago 0 0 0 0

... In collaboration with Amir Zadeh, Katerina Fragkiadaki (@KaterinaFragiad@KaterinaFragiad) and Deepak Pathak (@pathak2206@pathak2206) at @mldcmu@mldcmu

8 months ago 0 0 1 0

... Project webpage & code - diffusion-scaling.github.iohttps//diffusion-scaling.githu...

Arxiv - arxiv.org/abs/2507.15857https://ar...

This project was co-led with Menging Wu (@WuMengning54261@WuMengning54261). ...

8 months ago 0 0 1 0

... 🚨#8: A natural question here is—why does diffusion outperform AR when data is limited?

We hypothesize that the key advantage stems from the use of random masking in diffusion models, which serves as a form of data augmentation. Unlike AR models, which are trained on a single,

---

🚨#9: ...

8 months ago 0 0 1 0

... 🚨Finding #7: The data efficiency of diffusion models translates to better downstream performance.

Lastly we evaluate the best-performing diffusion and AR models (trained under the same data budget) on a range of language understanding tasks.

Across most benchmarks, diffusion

--- ...

8 months ago 0 0 1 0

... 🚨 Finding #6: The compute required for diffusion to outperform AR follows a predictable power law.

Above we defined the critical compute threshold as the amount of FLOPs where diffusion matches AR performance for a given unique dataset size.

We find that we can derive a simple

--- ...

8 months ago 0 0 1 0

... ---

🚨 Finding #5: Muennighoff et al showed that repeating the dataset up to 4 epochs is nearly as effective as using fresh data for autoregressive models.

In contrast, we find that diffusion models can be trained on repeated data for up to 100 epochs, while having repeated data

--- ...

8 months ago 0 0 1 0

... 🚨 Finding #4: Diffusion models exhibit a much higher half-life of data reuse (R_D*) —i.e., the number of epochs after which returns from repeating data begins to significantly diminish.

We adopt the data-constrained scaling framework introduced by @Muennighoff@Muennighoff et al. in their ...

8 months ago 0 0 1 0

... 🚨Finding #3: Diffusion models are significantly more robust to data repetition than autoregressive (AR) models.

We show training curves of models trained with the same total compute, but different trade-offs between unique data and number of epochs.

An “epoch” here means

--- ...

8 months ago 0 0 1 0

... 🚨 Finding #2: Autoregressive models begin to overfit much quickly, while diffusion shows no signs of overfitting even after 10x the number of epochs.
In the above figure, we showed that increasing compute eventually favors diffusion. But compute can be scaled in two ways:

(i)

--- ...

8 months ago 0 0 1 0

... 🚨 Finding #1: Diffusion models outperform autoregressive models when trained with sufficient compute (i.e., more epochs & parameters).

Across different unique data scales, we observe:

1️⃣ At low compute, Autoregressive models win.
2️⃣ After a certain amount of compute,

--- ...

8 months ago 0 0 1 0

Image 1 from X post

Image 2 from X post

Image 3 from X post

Image 4 from X post

🚨 The era of infinite internet data is ending, So we ask:

👉 What’s the right generative modelling objective when data—not compute—is the bottleneck?

TL;DR:

▶️Compute-constrained? Train Autoregressive models

▶️Data-constrained? Train Diffusion models

Get ready for 🤿 1/n

--- ...

8 months ago 1 0 1 0

Original: x.com/mihirp98/status/1947736993229885545

8 months ago 0 0 0 0

... In collaboration with Amir Zadeh, Katerina Fragkiadaki (@KaterinaFragiad@KaterinaFragiad) and Deepak Pathak (@pathak2206@pathak2206) at @mldcmu@mldcmu

8 months ago 0 0 1 0

... Project webpage & code - diffusion-scaling.github.iohttps//diffusion-scaling.githu...

Arxiv - arxiv.org/abs/2507.15857https://ar...

This project was co-led with Menging Wu (@WuMengning54261@WuMengning54261). ...

8 months ago 0 0 1 0

... 🚨#8: A natural question here is—why does diffusion outperform AR when data is limited?

We hypothesize that the key advantage stems from the use of random masking in diffusion models, which serves as a form of data augmentation. Unlike AR models, which are trained on a single,

---

🚨#9: ...

8 months ago 0 0 1 0

... 🚨Finding #7: The data efficiency of diffusion models translates to better downstream performance.

Lastly we evaluate the best-performing diffusion and AR models (trained under the same data budget) on a range of language understanding tasks.

Across most benchmarks, diffusion

--- ...

8 months ago 0 0 1 0

... 🚨 Finding #6: The compute required for diffusion to outperform AR follows a predictable power law.

Above we defined the critical compute threshold as the amount of FLOPs where diffusion matches AR performance for a given unique dataset size.

We find that we can derive a simple

--- ...

8 months ago 0 0 1 0

... ---

🚨 Finding #5: Muennighoff et al showed that repeating the dataset up to 4 epochs is nearly as effective as using fresh data for autoregressive models.

In contrast, we find that diffusion models can be trained on repeated data for up to 100 epochs, while having repeated data

--- ...

8 months ago 0 0 1 0

Posts by Cosmin Stamate