The sandwich technique came up again. So I decided to frame it properly
Posts by Rishabh Kabra
I had a score disappear even when the reviewer said they will maintain their score. So it has likely nothing to do with whether the score changed.
Scaling 4D Representations
Scaling 4D Representations
Self-supervised learning from video does scale! In our latest work, we scaled masked auto-encoding models to 22B params, boosting performance on pose estimation, tracking & more.
Paper: arxiv.org/abs/2412.15212
Code & models: github.com/google-deepmind/representations4d
Veo 3 goes to Glastonbury:
www.youtube.com/watch?v=aKkr...
@googleuk.bsky.social
Video vs. image diffusion representations
Feature visualization for image and video diffusion
Generative Video Diffusion: does a model trained with this objective learn better features compared to image generation?
We investigated this question and more in our latest work, please check it out!
*From Image to Video: An Empirical Study of Diffusion Representations*
arxiv.org/abs/2502.07001
A self-supervised video representation model that allows visual tokens to move “off-the-grid” to represent scene elements consistently as they move across the image plane. We evaluate on downstream tasks including point tracking, monocular depth estimation, and object tracking.
moog-paper.github.io
*Moving Off-the-Grid: Scene-Grounded Video Representations*.
Thursday afternoon poster.
We learn per-object tokens (Neural Assets) that disentangle appearance and 3D pose from multi-object scenes. A sequence-of-tokens format allows us to reuse the text-to-image architecture of existing generative models.
neural-assets-paper.github.io
*Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models*.
Thursday morning poster.
I’m hanging out at NeurIPS this week. Come check out my co-authors’ presentations of the following Spotlight papers!