Advertisement · 728 × 90

Posts by Johannes Schusterbauer

I’m thrilled to share that I’ll present two first-authored papers at #ICCV2025 🌺 in Honolulu together with @mgui7.bsky.social ! 🏝️
(Thread 🧵👇)

6 months ago 5 3 1 1

This work was co-led by @mgui7.bsky.social and me and wouldn't have been possible without the help of all the other collaborators: @timyphan.bsky.social, Felix Krause, @kindsuss.bsky.social, @itsbautistam.bsky.social, and Björn Ommer.

A big thank you to all of them🙏

6 months ago 0 0 0 0

RepTok merges representation learning & generation

A self-supervised token becomes the latent of a generative model.
It’s efficient, continuous, and geometry-preserving - no quantization, no attention overhead.

Check it out
💻 github.com/CompVis/RepTok
📄 arxiv.org/abs/2510.14630

6 months ago 1 0 1 0

Ablations show:
• Works across SSL encoders (DINOv2 best, CLIP & MAE close)
• Cosine-similarity loss balances fidelity vs generativity
• Without SSL priors → reconstructions good, generations collapse

6 months ago 0 0 1 0
Post image

RepTok’s geometry stays smooth: linear interpolations in latent space yield natural transitions in both shape and semantics.

This shows that the single-token latent preserves structured continuity - not just abstract semantics.

6 months ago 0 0 1 0
Post image

Even with limited training budget we still reach competitive zero-shot FID on MS-COCO - rivaling much larger diffusion models.

6 months ago 0 0 1 0
Post image

We also extend RepTok to text-to-image generation using cross-attention to embeddings of frozen language models.

Training: <20 h on 4×A100 GPUs.

6 months ago 0 0 1 0
Post image

For generation, we model the latent space directly with an MLP-Mixer (no attention at all!).

Since there’s only one token, token-to-token attention isn’t needed - drastically reducing compute.

Training cost drops by >90% vs transformer-based diffusion maintaining a competitive FID on ImageNet.

6 months ago 0 0 1 0
Post image

Despite using just one token (dim ~768), RepTok reconstructs images faithfully and achieves:

📉 rFID = 1.85 on ImageNet-256
📈 PSNR = 14.9

That’s better or comparable to multi-token methods like TiTok or FlexTok - with a single continuous token.

6 months ago 0 0 1 0
Post image

🔑 The key to making it all work:

We introduce an additional loss term to keep the tuned [CLS] token close to the original representation!

❗️This keeps it semantically structured yet reconstruction-aware.

6 months ago 0 0 1 0
Advertisement
Post image

💡 The idea

We start from a frozen self-supervised encoder (DINOv2, MAE, or CLIP) and combine it with a generative decoder.

Then we fine-tune only the [CLS] token embedding - injecting low-level info while keeping the rest frozen.

6 months ago 0 0 1 0

💸 Current diffusion or flow models operate on redundant and expensive 2D latent grids...

VAEs, diffusion AEs, or tokenizers use large number of latent tokens / patches.

But images often share structure that could be represented compactly!

6 months ago 0 0 1 0
Post image

🤔 What if you could generate an entire image using just one continuous token?

💡 It works if we leverage a self-supervised representation!

Meet RepTok🦎: A generative model that encodes an image into a single continuous latent while keeping realism and semantics. 🧵 👇

6 months ago 10 4 1 1
Post image

🤔 What happens when you poke a scene — and your model has to predict how the world moves in response?

We built the Flow Poke Transformer (FPT) to model multi-modal scene dynamics from sparse interactions.

It learns to predict the 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 of motion itself 🧵👇

6 months ago 24 8 1 1
Preview
Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment Diffusion models have revolutionized generative tasks through high-fidelity outputs, yet flow matching (FM) offers faster inference and empirical performance gains. However, current foundation FM mode...

If you are interested, feel free to check the paper (arxiv.org/abs/2506.02221) or come by at CVPR:

📌 Poster Session 6, Sunday 4:00 to 6:00 PM, Poster #208

10 months ago 5 2 0 0

It's a framework that bridges Diffusion and Flow Matching paradigms by rescaling timesteps, aligning interpolants, and deriving FM-compatible velocity fields. This enables efficient FM finetuning of diffusion priors, retaining their knowledge while giving us the benefits of Flow Matching 🚀

10 months ago 1 0 1 0
Post image

Looking forward to attending #CVPR2025 in Nashville next week 🎸🎶 @mgui7.bsky.social and I will be presenting our latest work:

🌊 Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment

10 months ago 4 1 1 0
Post image

Sunrise in the office after the #ICCV deadline night with @mgui7.bsky.social 🚀

1 year ago 13 2 1 0
Building a New Foundation Model (Björn Ommer) | DLD25
Building a New Foundation Model (Björn Ommer) | DLD25 YouTube video by DLD Conference

www.youtube.com/watch?v=bCy6...

1 year ago 10 2 0 0

Over 60 German universities and research institutions announced their departure from X today.

1 year ago 76 15 0 0
Advertisement
Our method pipeline

Our method pipeline

🤔When combining Vision-language models (VLMs) with Large language models (LLMs), do VLMs benefit from additional genuine semantics or artificial augmentations of the text for downstream tasks?

🤨Interested? Check out our latest work at #AAAI25:

💻Code and 📝Paper at: github.com/CompVis/DisCLIP

🧵👇

1 year ago 15 8 1 0

Congrats to @frankfundel.bsky.social for publishing this work at WACV🔥

Has been a pleasure to jointly work on this topic with such a talented master student🤗

Looking forward to seeing what comes next!🚀

1 year ago 4 0 0 0

Awesome work from some colleagues cleaning up diffusion features!🚀

1 year ago 6 0 0 0
Preview
Taming Transformers for High-Resolution Image Synthesis Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias tha...

IMO VQGAN is why GANs deserve the NeurIPS test of time award. Suddenly our image representations were an order of magnitude more compact. Absolute game changer for generative modelling at scale, and the basis for latent diffusion models.

1 year ago 104 16 2 2

Hi, would be happy to be on that list as well. Working on Diffusion & Flow matching at @compvis.bsky.social under the supervision of Björn Ommer..

1 year ago 1 0 1 0

Check out my GenAI starter pack! go.bsky.app/BT1bRvZ

1 year ago 10 3 0 0

After many years, our lab finally has a social media presence at @compvis.bsky.social ! 🥳
Give it a follow, we have some amazing research on generative computer vision coming soon!

1 year ago 19 2 0 0
Post image

me right now..

1 year ago 48 3 4 0

In a gratuitous attempt to acquire more followers myself 😁, I've made a start on a "starter pack". Hopefully as more people from 🐦 make it over to 🦋, we can extend this a bit. Suggestions welcome!

I've noticed not all accounts seem to be eligible to be added, anyone know what's up with that? 🤔

1 year ago 126 37 34 10
Advertisement