I’m thrilled to share that I’ll present two first-authored papers at #ICCV2025 🌺 in Honolulu together with @mgui7.bsky.social ! 🏝️
(Thread 🧵👇)
Posts by Johannes Schusterbauer
This work was co-led by @mgui7.bsky.social and me and wouldn't have been possible without the help of all the other collaborators: @timyphan.bsky.social, Felix Krause, @kindsuss.bsky.social, @itsbautistam.bsky.social, and Björn Ommer.
A big thank you to all of them🙏
RepTok merges representation learning & generation
A self-supervised token becomes the latent of a generative model.
It’s efficient, continuous, and geometry-preserving - no quantization, no attention overhead.
Check it out
💻 github.com/CompVis/RepTok
📄 arxiv.org/abs/2510.14630
Ablations show:
• Works across SSL encoders (DINOv2 best, CLIP & MAE close)
• Cosine-similarity loss balances fidelity vs generativity
• Without SSL priors → reconstructions good, generations collapse
RepTok’s geometry stays smooth: linear interpolations in latent space yield natural transitions in both shape and semantics.
This shows that the single-token latent preserves structured continuity - not just abstract semantics.
Even with limited training budget we still reach competitive zero-shot FID on MS-COCO - rivaling much larger diffusion models.
We also extend RepTok to text-to-image generation using cross-attention to embeddings of frozen language models.
Training: <20 h on 4×A100 GPUs.
For generation, we model the latent space directly with an MLP-Mixer (no attention at all!).
Since there’s only one token, token-to-token attention isn’t needed - drastically reducing compute.
Training cost drops by >90% vs transformer-based diffusion maintaining a competitive FID on ImageNet.
Despite using just one token (dim ~768), RepTok reconstructs images faithfully and achieves:
📉 rFID = 1.85 on ImageNet-256
📈 PSNR = 14.9
That’s better or comparable to multi-token methods like TiTok or FlexTok - with a single continuous token.
🔑 The key to making it all work:
We introduce an additional loss term to keep the tuned [CLS] token close to the original representation!
❗️This keeps it semantically structured yet reconstruction-aware.
💡 The idea
We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.
Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.
💸 Current diffusion or flow models operate on redundant and expensive 2D latent grids...
VAEs, diffusion AEs, or tokenizers use large number of latent tokens / patches.
But images often share structure that could be represented compactly!
🤔 What if you could generate an entire image using just one continuous token?
💡 It works if we leverage a self-supervised representation!
Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵 👇
🤔 What happens when you poke a scene — and your model has to predict how the world moves in response?
We built the Flow Poke Transformer (FPT) to model multi-modal scene dynamics from sparse interactions.
It learns to predict the 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 of motion itself 🧵👇
If you are interested, feel free to check the paper (arxiv.org/abs/2506.02221) or come by at CVPR:
📌 Poster Session 6, Sunday 4:00 to 6:00 PM, Poster #208
It's a framework that bridges Diffusion and Flow Matching paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields. This enables efficient FM finetuning of diffusion priors, retaining their knowledge while giving us the benefits of Flow Matching 🚀
Looking forward to attending #CVPR2025 in Nashville next week 🎸🎶 @mgui7.bsky.social and I will be presenting our latest work:
🌊 Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment
Sunrise in the office after the #ICCV deadline night with @mgui7.bsky.social 🚀
www.youtube.com/watch?v=bCy6...
Over 60 German universities and research institutions announced their departure from X today.
Our method pipeline
🤔When combining Vision-language models (VLMs) with Large language models (LLMs), do VLMs benefit from additional genuine semantics or artificial augmentations of the text for downstream tasks?
🤨Interested? Check out our latest work at #AAAI25:
💻Code and 📝Paper at: github.com/CompVis/DisCLIP
🧵👇
Congrats to @frankfundel.bsky.social for publishing this work at WACV🔥
Has been a pleasure to jointly work on this topic with such a talented master student🤗
Looking forward to seeing what comes next!🚀
Awesome work from some colleagues cleaning up diffusion features!🚀
IMO VQGAN is why GANs deserve the NeurIPS test of time award. Suddenly our image representations were an order of magnitude more compact. Absolute game changer for generative modelling at scale, and the basis for latent diffusion models.
Hi, would be happy to be on that list as well. Working on Diffusion & Flow matching at @compvis.bsky.social under the supervision of Björn Ommer..
Check out my GenAI starter pack! go.bsky.app/BT1bRvZ
After many years, our lab finally has a social media presence at @compvis.bsky.social ! 🥳
Give it a follow, we have some amazing research on generative computer vision coming soon!
me right now..
In a gratuitous attempt to acquire more followers myself 😁, I've made a start on a "starter pack". Hopefully as more people from 🐦 make it over to 🦋, we can extend this a bit. Suggestions welcome!
I've noticed not all accounts seem to be eligible to be added, anyone know what's up with that? 🤔