Many thanks to @phillipisola.bsky.social for the thought-provoking hypothesis, and for discussing and engaging openly with our disagreements -- a rare kind of intellectual generosity.
9/9
Posts by A. Sophia Koepke
with Daniil Zverev, @shiryginosar.bsky.social, and Alexei A. Efros.
Berkeley AI Research, @munichcenterml.bsky.social, @tuebingen-ai.bsky.social, @tticconnect.bsky.social
8/9
Takeaway: Models, like organisms, may perceive the world within their Umwelt (check von Uexküll: en.wikipedia.org/wiki/Jakob_J...). We suspect future evidence will favor von Uexküll over Plato. Different models may learn rich representations of the world, just not the same one.
7/9
Observation 3: Real data is many-to-many (e.g. many images can fit the same caption). When relaxing the 1-to-1 evaluation constraint, alignment decreases further.
6/9
Observation 2: Coarse agreement persists but fine-grained agreement does not. In controlled settings, vision and language models reliably retrieve correct-class neighbors but rarely agree on the same instance.
5/9
Observation 1: On small datasets, neighbors are sparse, so models 'agree' as there aren't many options. At scale, neighbors get denser and more specialized within each modality (e.g. pose of the car vs. car name).
4/9
This is usually tested on ~1K samples. We scaled to 15M samples and found that alignment drops significantly.
3/9
Most experimental evidence for convergence comes from checking if an image and its caption embeddings share the same nearest neighbors (aka checking whether they are aligned).
2/9
New paper: Back into Plato’s Cave
Are vision and language models converging to the same representation of reality? The Platonic Representation Hypothesis says yes. BUT we find the evidence for this is more fragile than it looks.
Project page: akoepke.github.io/cave_umwelten/
1/9
Thanks to Daniil Zverev*, @thwiedemer.bsky.social*, @bayesiankitten.bsky.social, Matthias Bethge (@bethgelab.bsky.social), and @wielandbrendel.bsky.social for making VGGSound sounder! 🙌 🎉 🐗
📊 With VGGSounder, we show that existing models don’t always benefit from multimodal input and sometimes performance even degrades.
Code and data: vggsounder.github.io
VGGSounder is a new video classification benchmark for audio-visual foundation models:
We provide:
📢 Re-annotated VGGSound test set
📢 Modality-specific manual labels
📢 A modality confusion metric to diagnose when models misuse modalities
Paper: arxiv.org/pdf/2508.08237
🎉 Excited to present our paper VGGSounder: Audio‑Visual Evaluations for Foundation Models today at #ICCV2025!
🕦 Poster Session 1 | 11:30–13:30
📍 Poster #88
Come by if you're into audio-visual learning and want to know whether multiple modalities actually help or hurt.
Thanks to @munichcenterml.bsky.social for supporting the workshop with a best paper award (announced at 2.50pm CDT)!
We have fantastic speakers, including @saining.bsky.social, @aidanematzadeh.bsky.social, @ranjaykrishna.bsky.social, Ludwig Schmidt, @lisadunlap.bsky.social, and Ishan Misra.
Our #CVPR2025 workshop on Emergent Visual Abilities and Limits of Foundation Models (EVAL-FoMo) is taking place this afternoon (1-6pm) in room 210.
Workshop schedule: sites.google.com/view/eval-fo...
Screenshot of the workshop website "Emergent Visual Abilities and Limits of Foundation Models" at CVPR 2025
Our paper submission deadline for the EVAL-FoMo workshop @cvprconference.bsky.social has been extended to March 19th!
sites.google.com/view/eval-fo...
We welcome submissions (incl. published papers) on the analysis of emerging capabilities / limits in visual foundation models. #CVPR2025
Our 2nd Workshop on Emergent Visual Abilities and Limits of Foundation Models (EVAL-FoMo) is accepting submissions. We are looking forward to talks by our amazing speakers that include @saining.bsky.social, @aidanematzadeh.bsky.social, @lisadunlap.bsky.social, and @yukimasano.bsky.social. #CVPR2025
Upcoming 𝗠𝘂𝗻𝗶𝗰𝗵 𝗔𝗜 𝗟𝗲𝗰𝘁𝘂𝗿𝗲 featuring Prof. Franca Hoffmann from California Institute of Technology and Prof. Holger Hoos from RWTH Aachen University: munichlectures.ai
🗓️ December 17, 2024
🕙 16:00 CET
🏫 Senatssaal, #LMU Munich
Kicking off our TUM AI - Lecture Series tomorrow with none other than Jiaming Song, CSO @LumaLabsAI.
He'll be talking about "Dream Machine: Emergent Capabilities from Video Foundation Models".
Live stream: youtu.be/oilWwsXZamA
7pm GMT+1 / 10am PST (Mon Dec 2nd)