Do we really need pixel generation to model motion? 🤔
We show how directly representing motion in a compact space enables efficient, scalable planning.
10,000× faster than video models, enabling planning and reasoning in open-world and robotics settings.
Check it out ⬇️
Posts by Kolja Bauer
You don't imagine the future by mentally rendering a movie. You trace how things move -- abstractly, sparsely, step by step.
We built a model that does exactly this. It predicts motion, not pixels -- and it's 3,000× faster than video world models.
Myriad, accepted at
@cvprconference.bsky.social
I’m thrilled to share that I’ll present two first-authored papers at #ICCV2025 🌺 in Honolulu together with @mgui7.bsky.social ! 🏝️
(Thread 🧵👇)
🤔 What happens when you poke a scene — and your model has to predict how the world moves in response?
We built the Flow Poke Transformer (FPT) to model multi-modal scene dynamics from sparse interactions.
It learns to predict the 𝘥𝘪𝘴𝘵𝘳𝘪𝘣𝘶𝘵𝘪𝘰𝘯 of motion itself 🧵👇
Our method pipeline
🤔When combining Vision-language models (VLMs) with Large language models (LLMs), do VLMs benefit from additional genuine semantics or artificial augmentations of the text for downstream tasks?
🤨Interested? Check out our latest work at #AAAI25:
💻Code and 📝Paper at: github.com/CompVis/DisCLIP
🧵👇
In order to extract features from diffusion models, you have to noise your input and tune the noise level for each downstream task. But isn't there a better way? 🤔
Turns out there is, using our newly proposed feature extraction method CleanDIFT 🧹🚀
Check it out ⬇️
Hi, I recently started as an ELLIS PhD student at Björn Ommer's lab. I would be happy to be on the list as well :)
After many years, our lab finally has a social media presence at @compvis.bsky.social ! 🥳
Give it a follow, we have some amazing research on generative computer vision coming soon!