New paper: Back into Platoโs Cave
Are vision and language models converging to the same representation of reality? The Platonic Representation Hypothesis says yes. BUT we find the evidence for this is more fragile than it looks.
Project page: akoepke.github.io/cave_umwelten/
1/9
Posts by Dominik Schnaus
๐ ๐๐ ๐ ๐๐น๐ผ๐ด: Images and text are usually aligned using millions of imageโcaption pairs. But could they still be matched if they were never seen together?
In โItโs a (Blind) Match!โ, MCML Members explore this question.
mcml.ai/news/2026-01...
๐ฆ We present โFeed-Forward SceneDINO for Unsupervised Semantic Scene Completionโ. #ICCV2025
๐: visinf.github.io/scenedino/
๐: arxiv.org/abs/2507.06230
๐ค: huggingface.co/spaces/jev-a...
@jev-aleks.bsky.social @fwimbauer.bsky.social @olvrhhn.bsky.social @stefanroth.bsky.social @dcremers.bsky.social
The code for our #CVPR2025 paper, PRaDA: Projective Radial Distortion Averaging, is now out!
Turns out distortion calibration from multiview 2D correspondences can be fully decoupled from 3D reconstruction, greatly simplifying the problem
arxiv.org/abs/2504.16499
github.com/DaniilSinits...
4/4
๐๐ญโ๐ฌ ๐ (๐๐ฅ๐ข๐ง๐) ๐๐๐ญ๐๐ก! ๐๐จ๐ฐ๐๐ซ๐๐ฌ ๐๐ข๐ฌ๐ข๐จ๐งโ๐๐๐ง๐ ๐ฎ๐๐ ๐ ๐๐จ๐ซ๐ซ๐๐ฌ๐ฉ๐จ๐ง๐๐๐ง๐๐ ๐ฐ๐ข๐ญ๐ก๐จ๐ฎ๐ญ ๐๐๐ซ๐๐ฅ๐ฅ๐๐ฅ ๐๐๐ญ๐
@schnaus.bsky.social @neekans.bsky.social @dcremers.bsky.social
๐ย Paper: arxiv.org/pdf/2503.241...
๐ย Project page: dominik-schnaus.github.io/itsamatch/
๐ปย Code: github.com/dominik-schn...
3/4
โ
ย This enables unsupervised matching โ finding vision-language correspondences without any paired data.
๐คฏย As a proof of concept, we build an unsupervised image classifier that assigns labels without seeing a single image-text pair.
2/4
๐ย As models and datasets scale, distances in vision and language embeddings become similar (Platonic Representation Hypothesis).
๐กย We cast the matching task as a Quadratic Assignment Problem (QAP) and propose a new heuristic solver.
Can we match vision and language representations without any supervision or paired data?
Surprisingly, yes!ย
Our #CVPR2025 paper with @neekans.bsky.social and @dcremers.bsky.social shows that the pairwise distances in both modalities are often enough to find correspondences.
โฌ๏ธ 1/4
Can you train a model for pose estimation directly on casual videos without supervision?
Turns out you can!
In our #CVPR2025 paper AnyCam, we directly train on YouTube videos and achieve SOTA results by using an uncertainty-based flow loss and monocular priors!
โฌ๏ธ
Check out our latest recent #CVPR2025 paper AnyCam, a fast method for pose estimation in casual videos!
1๏ธโฃ Can be directly trained on casual videos without the need for 3D annotation.
2๏ธโฃ Based around a feed-forward transformer and light-weight refinement.
Code and more info: โฉ fwmb.github.io/anycam/
We are thrilled to have 12 papers accepted to #CVPR2025. Thanks to all our students and collaborators for this great achievement!
For more details check out cvg.cit.tum.de
Indeed - everyone had a blast - thank you all for the great talks, discussions and Ski/snowboarding!