Samuel Lavoie (@lavoiems) Bsky

Compositionality is a central desideratum for intelligent systems...but it's a fuzzy concept and difficult to quantify. In this blog post, lab member @ericelmoznino.bsky.social outlines ideas toward formalizing it & surveys recent work. A must-read for interested researchers in AI and Neuro

8 months ago 21 5 0 0

This work wouldn’t exist without my amazing co-authors:
@mnoukhov.bsky.social & @AaronCourville🙏

9 months ago 0 0 0 0

GitHub - lavoiems/DiscreteLatentCode: Official repository for the article Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models (https://arxiv.org/abs/2507.12318) Official repository for the article Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models (https://arxiv.org/abs/2507.12318) - lavoiems/DiscreteLatentCode

Code & Models are open source:
💾 github.com/lavoiems/Dis...
📜https://arxiv.org/pdf/2507.12318

Reproduce, remix, build your own DLC-powered models.

9 months ago 0 0 1 0

Example: There are no “teapots on mountains” in ImageNet.

We verify this via nearest-neighbor search in DinoV2 space.
But our model can still create them—by composing concepts it learned separately.

9 months ago 0 0 1 0

LLMs can speak in DLC!

We fine-tune a language model to sample DLC tokens from text, giving us a pipeline:
Text → DLC → Image
This also enables generation beyond ImageNet.

9 months ago 0 0 1 0

DLCs are compositional.
Swap tokens between two images (🐕 Komodor + 🍝 Carbonara) → the model produces coherent hybrids never seen during training.

9 months ago 0 0 1 0

🚀 Results:

DiT-XL/2 + DLC → FID 1.59 on unconditional ImageNet

Works well with and without classifier-free guidance

Learns faster and better than prior works using pre-trained encoders

🤯

9 months ago 0 0 1 0

Unconditional generation pipeline:
Sample a DLC (e.g., with SEDD)

Decode it into an image (e.g., with DiT)

This ancestral sampling approach is simple but powerful.

9 months ago 1 0 1 0

DLCs enables exactly this.
Images → sequences of discrete tokens via a Simplicial Embedding (SEM) encoder

We take the argmax over token distributions → get the DLC sequence

Think of it as “tokenizing” images—like words for LLMs.

9 months ago 0 0 1 0

Text models don’t have this problem! LLMs can model internet scale corpus.

So… can we improve image generation of highly-modal distributions by decomposing it into:

1. Generating discrete tokens - p(c)
2. Decoding tokens into images - p(x|c)

9 months ago 0 0 1 0

Modeling highly multimodal distributions in continuous space is hard.
Even a simple 2D Gaussian mixture with a large number of modes may be tricky to model directly. Good conditioning solves this!

Could this be why large image generative models are almost always conditional? 🤔

9 months ago 0 0 1 0

🧵 Everyone is chasing new diffusion models—but what about the representations they model from?
We introduce Discrete Latent Codes (DLCs):
- Discrete representation for diffusion models
- Uncond. gen. SOTA FID (1.59 on ImageNet)
- Compositional generation
- Integrates with LLM
🧱

9 months ago 5 3 1 0

Modeling Caption Diversity in Contrastive Vision-Language Pretraining There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well CLIP-like mo...

The code and model weights for Llip are finally out! I hope you will find this model useful!
Paper: arxiv.org/abs/2405.00740
Code: github.com/facebookrese...
Models:
- ViT-G: huggingface.co/lavoies/llip...
- ViT-B: huggingface.co/lavoies/llip...

9 months ago 6 1 0 1

Congrats Lucas! Looking forward to see what will come out of your lab in Zurich!

1 year ago 1 0 0 0

Posts by Samuel Lavoie