… @timdarcet.bsky.social, Theo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, @jmairal.bsky.social, Herve Jegou, Patrick Labatut, Piotr Bojanowski
And of course, it’s open source! github.com/facebookrese...
📜 Paper: ai.meta.com/research/pub...
Posts by Max Seitzer
Immensely proud to have been part of this project. Thank you to the team: @oriane_simeoni, @huyvvo, @baldassarrefe.bsky.social, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, …
And here’s my favorite figure from the paper, showing high resolution DINOv3 representations in all their detail-capturing glory ✨
To recap:
1) The promise of SSL is finally realized, enabling foundation models across domains
2) High quality dense features enabling SotA applications
3) A versatile family of models for diverse deploy scenarios
So many great ideas (Gram anchoring!) to how we got there, please read the paper!
Satellite you said? Yes, the same DINOv3 algorithm trained on satellite imagery produces a SotA model for geospatial tasks like canopy height estimation. And of course, learns beautiful feature maps. This is the magic of SSL 🪄
3) DINOv3 is a family of models covering all use cases:
• ViT-7B flagship model
• ViT-S/S+/B/L/H+ (21M-840M params)
• ConvNeXt variants for efficient inference
• Text-aligned ViT-L (dino.txt)
• ViT-L/7B for satellite
All inheriting the great dense features of the 7B!
Well, Jianyuan Wang of VGGT fame simply dropped DINOv3 into his pipeline and off-handedly got a new SotA 3D model out. Seems promising enough?
But what do these great features bring us? We reached SotA on three long-standing vision tasks, simply by building on a frozen ❄️ (!) DINOv3 backbone: detection (66.1 mAP@COCO), segmentation (63 mIoU@ADE), depth (eg 4.3 ARel@NYU). Not convinced yet?
2) DINOv3’s global understanding is strong, but its dense representations truly shine! There’s a clear gap between DINOv3 and prior methods across many tasks. This matters as pretrained dense features power many applications: MLLMs, video&3D understanding, robotics, generative models, …
1) Some history: on ImageNet classification, supervised and weakly-supervised models converged to the same plateau over the last years. With DINOv3, SSL finally reaches that level. This alone is a big deal: no more reliance on annotated data!
Introducing DINOv3 🦕🦕🦕
A SotA-enabling vision foundation model, trained with pure self-supervised learning (SSL) at scale.
High quality dense features, combining unprecedented semantic and geometric scene understanding.
Three reasons why this matters👇
✨Introducing SENSEI✨ We bring semantically meaningful exploration to model-based RL using VLMs.
With intrinsic rewards for novel yet useful behaviors, SENSEI showcases strong exploration in MiniHack, Pokémon Red & Robodesk.
Accepted at ICML 2025🎉
Joint work with @cgumbsch.bsky.social
🧵
Scaling 4D Representations
Scaling 4D Representations
Self-supervised learning from video does scale! In our latest work, we scaled masked auto-encoding models to 22B params, boosting performance on pose estimation, tracking & more.
Paper: arxiv.org/abs/2412.15212
Code & models: github.com/google-deepmind/representations4d
Introducing 3DGSim🧩— an end-to-end 3D physics simulator trained only on multi-view videos. It achieves spatial & temporal consistency w/o ground truth 3D info or heavy inductive biases— enabling scalability & generalization🚀
Kudos to Mikel + @andregeist.bsky.social
www.youtube.com/watch?v=3Ar3...