Compositionality is a central desideratum for intelligent systems...but it's a fuzzy concept and difficult to quantify. In this blog post, lab member @ericelmoznino.bsky.social outlines ideas toward formalizing it & surveys recent work. A must-read for interested researchers in AI and Neuro
Posts by Samuel Lavoie
This work wouldn’t exist without my amazing co-authors:
@mnoukhov.bsky.social & @AaronCourville🙏
Code & Models are open source:
💾 github.com/lavoiems/Dis...
📜https://arxiv.org/pdf/2507.12318
Reproduce, remix, build your own DLC-powered models.
Example: There are no “teapots on mountains” in ImageNet.
We verify this via nearest-neighbor search in DinoV2 space.
But our model can still create them—by composing concepts it learned separately.
LLMs can speak in DLC!
We fine-tune a language model to sample DLC tokens from text, giving us a pipeline:
Text → DLC → Image
This also enables generation beyond ImageNet.
DLCs are compositional.
Swap tokens between two images (🐕 Komodor + 🍝 Carbonara) → the model produces coherent hybrids never seen during training.
🚀 Results:
DiT-XL/2 + DLC → FID 1.59 on unconditional ImageNet
Works well with and without classifier-free guidance
Learns faster and better than prior works using pre-trained encoders
🤯
Unconditional generation pipeline:
Sample a DLC (e.g., with SEDD)
Decode it into an image (e.g., with DiT)
This ancestral sampling approach is simple but powerful.
DLCs enables exactly this.
Images → sequences of discrete tokens via a Simplicial Embedding (SEM) encoder
We take the argmax over token distributions → get the DLC sequence
Think of it as “tokenizing” images—like words for LLMs.
Text models don’t have this problem! LLMs can model internet scale corpus.
So… can we improve image generation of highly-modal distributions by decomposing it into:
1. Generating discrete tokens - p(c)
2. Decoding tokens into images - p(x|c)
Modeling highly multimodal distributions in continuous space is hard.
Even a simple 2D Gaussian mixture with a large number of modes may be tricky to model directly. Good conditioning solves this!
Could this be why large image generative models are almost always conditional? 🤔
🧵 Everyone is chasing new diffusion models—but what about the representations they model from?
We introduce Discrete Latent Codes (DLCs):
- Discrete representation for diffusion models
- Uncond. gen. SOTA FID (1.59 on ImageNet)
- Compositional generation
- Integrates with LLM
🧱
The code and model weights for Llip are finally out! I hope you will find this model useful!
Paper: arxiv.org/abs/2405.00740
Code: github.com/facebookrese...
Models:
- ViT-G: huggingface.co/lavoies/llip...
- ViT-B: huggingface.co/lavoies/llip...
Congrats Lucas! Looking forward to see what will come out of your lab in Zurich!