Also check out our work on compressed motion embeddings for explicitly goal-conditioned planning!
bsky.app/profile/rmsn...
Posts by Stefan Baumann
Video diffusion models learn motion indirectly through pixels.
But motion itself is much lower-dimensional.
We introduce 64Γ temporally compressed motion embeddings that directly capture scene dynamics.
This enables efficient planning -> 10,000Γ faster than video models.
π§΅π
Great to see corroborating evidence to our motion-forecasting.github.io work from other groups!
Check out this concurrent great work from
@stefanabaumann.bsky.social, @jannik-w.bsky.social, @tommymarto.bsky.social, Mahdi M. Kalayeh, and BjΓΆrn Ommer (@compvis.bsky.social)
This was a joint work as co-first authors with @jannik-w.bsky.social, and amazing support from @tommymarto.bsky.social, Mahdi M. Kalayeh, and BjΓΆrn Ommer (@compvis.bsky.social)
Predicting how the world moves -- not how it looks -- opens up fast, scalable future reasoning for robotics, planning, and embodied AI. Stop painting the future frame by frame. Just envision how it moves.
π Paper: arxiv.org/abs/2604.09527
π» Code & Models: compvis.github.io/myriad
Exciting to see @neerjathakkar et al arrive at a similar intuition independently -- point trajectories as the right abstraction for motion prediction in the wild
Great results forecasting animal motion across species. Concurrent work, shared conviction: trajectories > pixels
bsky.app/profile/neer...
We also release OWM, a benchmark for open-world sparse motion prediction in in-the-wild scenes. You only ever observe one future, but the model should capture all plausible ones -- evaluating that properly needed a new benchmark.
Where this gets fun: billiard shot planning. Sample thousands of "what if I strike it this way?" rollouts, pick the best one, execute.
Same training data, same compute budget. Myriad sinks the shot 78% of the time. Best video model: 16%.
You can plan well when you can actually explore enough futures
Why? 2,200 samples/min on one GPU vs. ~0.05-0.7 for video models. On our open-world motion benchmark, Myriad (665M params) matches or beats models with 4.5-14B params in accuracy. Match the compute budget between models, and the gap becomes massive.
From an image, we predict point trajectories with an efficient model -- one timestep at a time, like mentally unrolling a chain of interactions. No frames. No rendering. Just dynamics.
This avoids the "visual tax": the enormous cost video models pay to generate every pixel to reason about motion.
You don't imagine the future by mentally rendering a movie. You trace how things move -- abstractly, sparsely, step by step.
We built a model that does exactly this. It predicts motion, not pixels -- and it's 3,000Γ faster than video world models.
Myriad, accepted at
@cvprconference.bsky.social
It was a great pleasure having you, and I enjoyed every discussion! Come back anytime, the door's always open!
Same here - I've resorted to using spaced en dashes in papers (just to make it maximally obvious), hoping that it doesn't give off true LLM vibes while still letting me structure the text how I prefer it
I don't know what your colleague did, but you can do significantly better than 0.0013 if you use assistance. All of these samples still had at least a distance of 1 in 255^3 color
Interesting, I think I got 4 or 5 wrong when I got 0.0015, and the top right displayed 0.00080 for a bit before I started messing up if I'm not misremembering. Maybe it's partially random what minimum value you can achieve?
Interesting, I just tried it again and couldn't do nearly as well as in a dark room before (0.0023)
Definitely gonna be hooked for a bit trying to get below 0.001 - this is too fun and there are still way too many cases where I mess up a 50/50 chance and the other option would've been right
As Eric already said, primarily a matter of screen quality and ambient vs screen brightness - I would not consider myself to have extraordinarily good color perception, just a bright, good screen in a dark room.
Super fun though!
Sure, but how do you explain an H100 being lower than an A100? It should be better in every (memory-related) way. That single data point pair excludes most reasonable alternatives
I thought about that, but that also doesn't track though - an H100 has a much higher bandwidth than an A100
To clarify: I'm talking about the fact that the points do not correspond to the amount of VRAM, unlike implied
I'm talking about the fact that GPUs that have the same amount of VRAM are on different positions. This has nothing to do with axis scaling
The y axis for memory capacity looks a bit weird π€
For me, looking at both the reviews on my submissions and others' submissions, I see only ~10% clearly LLM-written reviews. Still bad, but better than last year's conferences imo
Interesting, thanks for the additional context! I assumed that a modern architeture with good PE should mostly fix these problems purely by inductive bias
Isn't this a problem primarily caused by additive PE that should be lessened significantly by attention-only PEs (RoPE, ALiBi)?
For LLMs, NoPE is a thing because of the causal attention mask - I don't quite see how you're imagining these findings should transfer to vision
I love it, as long as I have the time to do it. Personally, I prefer doing it for complex problems (e.g., developing our lab's shared large-scale distributed training codebase).
I also actively try to use vibe coding in risk-free places to learn to use those tools better
Feels like something that could be vibe coded quite easily (and safely): monitor that repo and auto-create a PR to yours (assuming you maintain your bot similarly) with missing dates. No risk, as you'd approve any changes and less work, as you'd get notified once dates are known
If you want to automate some of this, the repo at github.com/ccfddl/ccf-d... is openly licensed and quite reliable. You could auto-add dates for some conferences to the bot once they're added there
Last year Molmo set SOTA on image benchmarks + pioneered image pointing. Millions of downloads later, Molmo 2 brings Molmoβs grounded multimodal capabilities to video π₯βand leads many open models on challenging industry video benchmarks. π§΅