New paper: Back into Plato’s Cave
Are vision and language models converging to the same representation of reality? The Platonic Representation Hypothesis says yes. BUT we find the evidence for this is more fragile than it looks.
Project page: akoepke.github.io/cave_umwelten/
1/9
Posts by Shyamgopal Karthik
Hey everyone, super happy to share our work on quantum algorithms for heterogeneous partial differential equations (PDEs)! (1/4)
scirate.com/arxiv/2604.0...
This has now been accepted at @iclr-conf.bsky.social !
My guess is evaluating multimodal models turns out to be tricky because the language model is much larger/stronger than the other modalities, leading to a skewed approach by these models.
In the long run, I'm hopeful that long-form streaming video holds solutions for a lot of the problems we face.
Earlier this year, we'd spent a lot of time pushing the limits of blind baselines for vision-langauge compositionality benchmarks and found that they're surprisingly close to state-of-the-art on several benchmarks, and that filtering samples wasn't a great solution.
Link: arxiv.org/abs/2506.08227
Was a very fun (and quick) investigation into biases of multimodal benchmarks, this time on tasks designed for "Spatial Supersensing" introduced by Cambrian-S with some great folks!
🚨 New Paper: "Solving Spatial Supersensing Without Spatial Supersensing"
Huge credit to the Cambrian-S team for tackling one of the hardest open problems in video understanding: spatial supersensing. In our paper, we take a closer look at their benchmarks & methods 👇
Very nice! Am I going crazy or do you use "Pick", "Pick Score", "PickScore", and "PickAScore" to refer to the same reward (i.e github.com/yuvalkirstain/PickScore)?
Unfortunately, our submission to #NeurIPS didn’t go through with (5,4,4,3). But because I think it’s an excellent paper, I decided to share it anyway.
We show how to efficiently apply Bayesian learning in VLMs, improve calibration, and do active learning. Cool stuff!
📝 arxiv.org/abs/2412.06014
Wonderful story behind some very nice SSL work!
I'm in Nashville this week attending #CVPR2025. Excited to discuss post-training VLMs and diffusion models!
The members of the Cluster of Excellence "Machine Learning: New Perspectives for Science" raise their glasses and celebrate securing another funding period.
We're super happy: Our Cluster of Excellence will continue to receive funding from the German Research Foundation @dfg.de ! Here’s to 7 more years of exciting research at the intersection of #machinelearning and science! Find out more: uni-tuebingen.de/en/research/... #ExcellenceStrategy
Oh yes, nobody tells you in game that there's a tactic in this position. And you need to calculate a sacrifice fully in a game, and not play one move at a time. So it's not too hard to overfit, but doing online tactics well is a necessary but not sufficient condition to play chess well.
Maybe if people tried to overfit to online tactics ratings, sure. But having good calculation skills and awareness of tactical patterns is essential to being a good chess player, while "leetcode" is not essential to being a good programmer?
🚨 New preprint!
How far can we go with ImageNet for Text-to-Image generation? w. @arrijitghosh.bsky.social @lucasdegeorge.bsky.social @nicolasdufour.bsky.social @vickykalogeiton.bsky.social
TL;DR: Train a text-to-image model using 1000 less data in 200 GPU hrs!
📜https://arxiv.org/abs/2502.21318
🧵👇
These are some ridiculously good results from training tiny T2I models purely on ImageNet! It's almost too good to be true. Do check it out!
I've been talking about writing this paper to anyone who would listen since 2020. I bombed a bunch of job talks trying to convince companies to work on this. It's so nice to finally just be able to say, yes, self-play RL in a diverse world gives you immense capabilities
arxiv.org/abs/2502.03349
🚨Great Models Think Alike and this Undermines AI Oversight🚨
New paper quantifies LM similarity
(1) LLM-as-a-judge favor more similar models🤥
(2) Complementary knowledge benefits Weak-to-Strong Generalization☯️
(3) More capable models have more correlated failures 📈🙀
🧵👇
ReNO shows that some initial noise are better for some prompts! This is great to improve image generation, but i think it also shows a deeper property of diffusion models.
This is maybe my favorite thing I've seen out of #NeurIPS2024.
Head over to HuggingFace and play with this thing. It's quite extraordinary.
Can we enhance the performance of T2I models without any fine-tuning?
We show that with our ReNO, Reward-based Noise Optimization, one-step models consistently surpass the performance of all current open-source Text-to-Image models within the computational budget of 20-50 sec!
#NeurIPS2024
I will present ✌️ BDU workshop papers @ NeurIPS: one by Rui Li (looking for internships) and one by Anton Baumann.
🔗 to extended versions:
1. 🙋 "How can we make predictions in BDL efficiently?" 👉 arxiv.org/abs/2411.18425
2. 🙋 "How can we do prob. active learning in VLMs" 👉 arxiv.org/abs/2412.06014
After a break of over 2 years, I'm attending a conference again! Excited to attend NeurIPS, even more so to be presenting ReNO, getting inference-time scaling and preference optimization to work for text-to-image generation.
Do reach out if you'd like to chat!
🚨New Paper Alert🚨
🚀 Introducing FlowChef, "Steering Rectified Flow Models in the Vector Field for Controlled Image Generation"! 🌌✨
- Perform image editing, solve inverse problems, and more.
- Achieved inversion-free, gradient-free, & training-free inference time steering! 🤯
👇👇
Some recent discussions made me write up a short read on how I think about doing computer vision research when there's clear potential for abuse.
Alternative title: why I decided to stop working on tracking.
Curious about other's thoughts on this.
lb.eyer.be/s/cv-ethics....
Check out this nice work by @confusezius.bsky.social on designing VLMs for few-shot adaptation!
A real-time (or very fast) open-source txt2video model dropped: LTXV.
HF: huggingface.co/Lightricks/L...
Gradio: huggingface.co/spaces/Light...
Github: github.com/Lightricks/L...
Look at that prompt example though. Need to be a proper writer to get that quality.
Learning from one continuous video stream
- use a video stream to learn a predictive model
- everything is in pixel space
- update the model less frequently and don’t use momentum optimizer
- pre training with iid improves performance
- continual learning for robots
arxiv.org/html/2312.00...