Advertisement Β· 728 Γ— 90

Posts by Stefan Baumann

Also check out our work on compressed motion embeddings for explicitly goal-conditioned planning!
bsky.app/profile/rmsn...

1 week ago 1 0 0 0
Video

Video diffusion models learn motion indirectly through pixels.

But motion itself is much lower-dimensional.

We introduce 64Γ— temporally compressed motion embeddings that directly capture scene dynamics.

This enables efficient planning -> 10,000Γ— faster than video models.

πŸ§΅πŸ‘‡

1 week ago 12 2 1 2

Great to see corroborating evidence to our motion-forecasting.github.io work from other groups!

Check out this concurrent great work from
@stefanabaumann.bsky.social, @jannik-w.bsky.social, @tommymarto.bsky.social, Mahdi M. Kalayeh, and BjΓΆrn Ommer (@compvis.bsky.social)

1 week ago 2 1 0 0

This was a joint work as co-first authors with @jannik-w.bsky.social, and amazing support from @tommymarto.bsky.social, Mahdi M. Kalayeh, and BjΓΆrn Ommer (@compvis.bsky.social)

1 week ago 0 0 1 0
Preview
Envisioning the Future, One Step at a Time Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. ...

Predicting how the world moves -- not how it looks -- opens up fast, scalable future reasoning for robotics, planning, and embodied AI. Stop painting the future frame by frame. Just envision how it moves.

πŸ“„ Paper: arxiv.org/abs/2604.09527
πŸ’» Code & Models: compvis.github.io/myriad

1 week ago 2 0 1 0

Exciting to see @neerjathakkar et al arrive at a similar intuition independently -- point trajectories as the right abstraction for motion prediction in the wild
Great results forecasting animal motion across species. Concurrent work, shared conviction: trajectories > pixels
bsky.app/profile/neer...

1 week ago 2 0 1 0
Post image

We also release OWM, a benchmark for open-world sparse motion prediction in in-the-wild scenes. You only ever observe one future, but the model should capture all plausible ones -- evaluating that properly needed a new benchmark.

1 week ago 1 0 1 0
Video

Where this gets fun: billiard shot planning. Sample thousands of "what if I strike it this way?" rollouts, pick the best one, execute.
Same training data, same compute budget. Myriad sinks the shot 78% of the time. Best video model: 16%.
You can plan well when you can actually explore enough futures

1 week ago 1 0 1 0
Post image

Why? 2,200 samples/min on one GPU vs. ~0.05-0.7 for video models. On our open-world motion benchmark, Myriad (665M params) matches or beats models with 4.5-14B params in accuracy. Match the compute budget between models, and the gap becomes massive.

1 week ago 1 0 1 0
Post image

From an image, we predict point trajectories with an efficient model -- one timestep at a time, like mentally unrolling a chain of interactions. No frames. No rendering. Just dynamics.
This avoids the "visual tax": the enormous cost video models pay to generate every pixel to reason about motion.

1 week ago 2 0 1 0
Advertisement
Post image

You don't imagine the future by mentally rendering a movie. You trace how things move -- abstractly, sparsely, step by step.
We built a model that does exactly this. It predicts motion, not pixels -- and it's 3,000Γ— faster than video world models.
Myriad, accepted at
@cvprconference.bsky.social

1 week ago 23 8 2 2

It was a great pleasure having you, and I enjoyed every discussion! Come back anytime, the door's always open!

3 weeks ago 3 0 0 0

Same here - I've resorted to using spaced en dashes in papers (just to make it maximally obvious), hoping that it doesn't give off true LLM vibes while still letting me structure the text how I prefer it

1 month ago 1 0 0 1
Post image

I don't know what your colleague did, but you can do significantly better than 0.0013 if you use assistance. All of these samples still had at least a distance of 1 in 255^3 color

1 month ago 1 0 0 0

Interesting, I think I got 4 or 5 wrong when I got 0.0015, and the top right displayed 0.00080 for a bit before I started messing up if I'm not misremembering. Maybe it's partially random what minimum value you can achieve?

1 month ago 1 1 1 0

Interesting, I just tried it again and couldn't do nearly as well as in a dark room before (0.0023)

Definitely gonna be hooked for a bit trying to get below 0.001 - this is too fun and there are still way too many cases where I mess up a 50/50 chance and the other option would've been right

1 month ago 0 0 1 0
Post image

As Eric already said, primarily a matter of screen quality and ambient vs screen brightness - I would not consider myself to have extraordinarily good color perception, just a bright, good screen in a dark room.
Super fun though!

1 month ago 3 0 1 0

Sure, but how do you explain an H100 being lower than an A100? It should be better in every (memory-related) way. That single data point pair excludes most reasonable alternatives

2 months ago 1 0 0 0

I thought about that, but that also doesn't track though - an H100 has a much higher bandwidth than an A100

2 months ago 0 0 0 0

To clarify: I'm talking about the fact that the points do not correspond to the amount of VRAM, unlike implied

2 months ago 0 0 0 0
Advertisement

I'm talking about the fact that GPUs that have the same amount of VRAM are on different positions. This has nothing to do with axis scaling

2 months ago 0 0 2 0

The y axis for memory capacity looks a bit weird πŸ€”

2 months ago 3 0 4 0

For me, looking at both the reviews on my submissions and others' submissions, I see only ~10% clearly LLM-written reviews. Still bad, but better than last year's conferences imo

2 months ago 1 0 0 0

Interesting, thanks for the additional context! I assumed that a modern architeture with good PE should mostly fix these problems purely by inductive bias

3 months ago 0 0 0 0

Isn't this a problem primarily caused by additive PE that should be lessened significantly by attention-only PEs (RoPE, ALiBi)?

3 months ago 1 0 1 0

For LLMs, NoPE is a thing because of the causal attention mask - I don't quite see how you're imagining these findings should transfer to vision

3 months ago 0 0 0 0

I love it, as long as I have the time to do it. Personally, I prefer doing it for complex problems (e.g., developing our lab's shared large-scale distributed training codebase).

I also actively try to use vibe coding in risk-free places to learn to use those tools better

3 months ago 1 0 0 0

Feels like something that could be vibe coded quite easily (and safely): monitor that repo and auto-create a PR to yours (assuming you maintain your bot similarly) with missing dates. No risk, as you'd approve any changes and less work, as you'd get notified once dates are known

3 months ago 0 0 1 0
GitHub - ccfddl/ccf-deadlines: ⏰ Collaboratively track worldwide conference deadlines (Website, Python Cli, Wechat Applet) / If you find it useful, please star this project, thanks~ ⏰ Collaboratively track worldwide conference deadlines (Website, Python Cli, Wechat Applet) / If you find it useful, please star this project, thanks~ - ccfddl/ccf-deadlines

If you want to automate some of this, the repo at github.com/ccfddl/ccf-d... is openly licensed and quite reliable. You could auto-add dates for some conferences to the bot once they're added there

3 months ago 1 0 1 0
Post image Post image Post image

Last year Molmo set SOTA on image benchmarks + pioneered image pointing. Millions of downloads later, Molmo 2 brings Molmo’s grounded multimodal capabilities to video πŸŽ₯β€”and leads many open models on challenging industry video benchmarks. 🧡

4 months ago 15 3 1 0
Advertisement