Karim Farid (@kifarid) Bsky

A bridge builder doesn’t need to denounce every evil in the world to be moral, but they better say something about the guy who keeps building bridges that topple over

3 months ago 278 29 3 7

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these model...

Well this is exciting: arxiv.org/abs/2512.20605

3 months ago 54 7 1 0

Huge thanks to my coauthors @thomasbrox.bsky.social @rajatsahay.bsky.social @simonschrodi.bsky.social @yumnaali.bsky.social, Cordelia Schmid and Volker Fischer without whom this work wouldn’t have been realized. 🙏

4 months ago 0 0 0 0

What Drives Compositional Generalization in Visual Generative Models? Continuous objectives + full conditioning drive robust compositionality; discrete categorical losses hinder it. JEPA-style auxiliary loss improves MaskGIT.

If we want generative models that reason over combinations of concepts – not just produce aesthetically pleasing media – the choice of objective and conditioning matters.

Project page/paper (figures and details): lmb-freiburg.github.io/gen-comp-gen...
Paper: arxiv.org/abs/2510.03075

4 months ago 0 0 1 0

We validate these trends across Shapes2D/3D, CelebA, and world models (CLEVRER, CoVLA)—and see the same pattern:

continuous objectives + informative conditioning ⇒ robust compositional generalization.

(We even see early signs in language.)

4 months ago 0 0 1 0

What happens?

Compositional performance improves markedly.

Internal representations become more disentangled: reduced polysemanticity and less neuron overlap between concepts.

In other words, a continuous JEPA objective can inject compositional structure into a discrete model.

4 months ago 1 0 1 0

Discrete objectives (e.g., MaskGIT) are still attractive—fast, and ubiquitous in LLM-style training.

based on findings, can we keep output discreteness and get compositionality?

We add a JEPA-like cont. auxiliary loss to MaskGIT, supervising middle reps in continuous space.

4 months ago 0 0 1 0

Conditioning is critical.

If it’s quantized or incomplete (factors missing in training), compositionality becomes fragile or fails even if all factors are given at inference.

Access to the true generative factors is essential for continuous models to generalize compositionally.

4 months ago 0 0 1 0

The bottleneck is fundamental and surprisingly common in recent models:

Is the objective operating in continuous or discrete space?

Across controlled comparisons, continuous-valued outputs unlock compositionality, while discrete/categorical objectives consistently lag behind.

4 months ago 0 0 1 0

The tokenizer isn’t the main story.

DiT can reach similar compositionality with VAE or VQ-VAE. The learning curve differs (gradual vs abrupt), but both get there.

Tokenizers mostly affect efficiency + stability, not whether compositionality is possible.

4 months ago 0 0 1 0

Setup: train on only a subset of factor combinations (e.g., gender×hair×smile), holding out some compositions.

Then we generate + probe:
🟦 Seen (blue)
🟪 Level-1: change 1 factor (pink)
🟥 Level-2: change 2 factors (hardest/most novel) (red)
Shapes2D probes👇 (shape×color×size)

4 months ago 0 0 1 0

We study 3 axes that span most modern gen models without confounders:

1️⃣ Tokenizer (VAE vs VQ)

2️⃣ Modelling & objective (diffusion vs masked autoregressive, continuous vs discrete)

3️⃣ Conditioning

Interventions: given our findings, can we fix non-compositional models?

4 months ago 0 0 1 0

In our new work, we ask a simple question:

Which design choices actually enable (or prevent) compositional generalization?

We study this in a controlled setting across visual modalities—cutting down the search space for anyone training or using these models.

4 months ago 0 0 1 0

Generalization is the goal. A core piece is compositional generalization: recombining known concepts into new combinations.

It’s central to human intelligence, but we still don’t know what drives/hinders it in generative models and today’s design choices are not driven by it.

4 months ago 0 0 1 0

Seriously, what is the goal of today’s visual generative models?

Are pretty videos/images and low FIDs enough – or should we also demand something closer to human-like creativity? Our paper tries to answer this question 🧵

4 months ago 1 0 1 0

There are similarities between JEPAs and PFNs. In JEPAs, synthetic data is generated through learning. Notably, random weights can already perform well on downstream tasks, suggesting that the learning process induces useful operations on which you can do predictive coding.

6 months ago 2 0 0 0

Idk, but maybe not necessarily, we observe discrete tokens but the language states themselves can live in a continuous world.

6 months ago 0 0 1 0

Generative models that assume the underlying distribution is continuous, for example, flow matching and common diffusion models.

6 months ago 0 0 1 0

I really hope someone can revive continuous models for language. They’ve taken over the visual domain by far, but getting them to work in language still feels like pure alchemy.

6 months ago 4 0 1 0

Using Knowledge Graphs to harvest datasets for efficient CLIP model training Training high-quality CLIP models typically requires enormous datasets, which limits the development of domain-specific models -- especially in areas that even the largest CLIP models do not cover wel...

Excited to release our models and preprint: "Using Knowledge Graphs to harvest datasets for efficient CLIP model training"

We propose a dataset collection method using knowledge graphs and web image search, and create EntityNet-33M: a dataset of 33M images paired with 46M texts.

11 months ago 1 2 2 0

Over the past year, my lab has been working on fleshing out theory + applications of the Platonic Representation Hypothesis.

Today I want to share two new works on this topic:

Eliciting higher alignment: arxiv.org/abs/2510.02425
Unpaired learning of unified reps: arxiv.org/abs/2510.08492

1/9

6 months ago 133 34 1 5

Orbis shows that the objective matters.
Continuous modeling yields more stable and generalizable world models, yet true probabilistic coverage remains a challenge.

Immensely grateful to my co-authors @arianmousakhan.bsky.social, Sudhanshu Mittal, and Silvio Galesso, and to @thomasbrox.bsky.social

6 months ago 1 0 0 0

Under the hood 🧠

Orbis uses a hybrid tokenizer with semantic + detail tokens that work in both continuous and discrete spaces.
The world model then predicts the next frame by gradually denoising or unmasking it, using past frames as context.

6 months ago 1 0 1 0

Realistic and Diverse Rollouts 4/4

6 months ago 1 0 1 0

Realistic and Diverse Rollouts 3/4

6 months ago 1 0 1 0

Realistic and Diverse Rollouts 2/4

6 months ago 1 0 1 0

Realistic and Diverse Rollouts 1/4

6 months ago 1 0 1 0

Posts by Karim Farid