Shyamgopal Karthik (@shyamgopal) Bsky

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

New paper: Back into Plato’s Cave

Are vision and language models converging to the same representation of reality? The Platonic Representation Hypothesis says yes. BUT we find the evidence for this is more fragile than it looks.

Project page: akoepke.github.io/cave_umwelten/

1/9

4 days ago 55 15 2 4

Hey everyone, super happy to share our work on quantum algorithms for heterogeneous partial differential equations (PDEs)! (1/4)

scirate.com/arxiv/2604.0...

1 week ago 1 1 1 0

This has now been accepted at @iclr-conf.bsky.social !

2 months ago 34 2 2 0

My guess is evaluating multimodal models turns out to be tricky because the language model is much larger/stronger than the other modalities, leading to a skewed approach by these models.
In the long run, I'm hopeful that long-form streaming video holds solutions for a lot of the problems we face.

4 months ago 2 0 0 0

Earlier this year, we'd spent a lot of time pushing the limits of blind baselines for vision-langauge compositionality benchmarks and found that they're surprisingly close to state-of-the-art on several benchmarks, and that filtering samples wasn't a great solution.
Link: arxiv.org/abs/2506.08227

4 months ago 1 0 1 0

Was a very fun (and quick) investigation into biases of multimodal benchmarks, this time on tasks designed for "Spatial Supersensing" introduced by Cambrian-S with some great folks!

4 months ago 3 1 1 0

🚨 New Paper: "Solving Spatial Supersensing Without Spatial Supersensing"

Huge credit to the Cambrian-S team for tackling one of the hardest open problems in video understanding: spatial supersensing. In our paper, we take a closer look at their benchmarks & methods 👇

4 months ago 3 2 1 1

Very nice! Am I going crazy or do you use "Pick", "Pick Score", "PickScore", and "PickAScore" to refer to the same reward (i.e github.com/yuvalkirstain/PickScore)?

5 months ago 1 0 1 0

Post-hoc Probabilistic Vision-Language Models Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descripti...

Unfortunately, our submission to #NeurIPS didn’t go through with (5,4,4,3). But because I think it’s an excellent paper, I decided to share it anyway.

We show how to efficiently apply Bayesian learning in VLMs, improve calibration, and do active learning. Cool stuff!

📝 arxiv.org/abs/2412.06014

7 months ago 51 16 2 1

Wonderful story behind some very nice SSL work!

9 months ago 0 0 0 0

I'm in Nashville this week attending #CVPR2025. Excited to discuss post-training VLMs and diffusion models!

10 months ago 10 1 0 0

The members of the Cluster of Excellence "Machine Learning: New Perspectives for Science" raise their glasses and celebrate securing another funding period.

We're super happy: Our Cluster of Excellence will continue to receive funding from the German Research Foundation @dfg.de ! Here’s to 7 more years of exciting research at the intersection of #machinelearning and science! Find out more: uni-tuebingen.de/en/research/... #ExcellenceStrategy

10 months ago 74 20 4 5

Oh yes, nobody tells you in game that there's a tactic in this position. And you need to calculate a sacrifice fully in a game, and not play one move at a time. So it's not too hard to overfit, but doing online tactics well is a necessary but not sufficient condition to play chess well.

11 months ago 2 0 1 0

Maybe if people tried to overfit to online tactics ratings, sure. But having good calculation skills and awareness of tactical patterns is essential to being a good chess player, while "leetcode" is not essential to being a good programmer?

11 months ago 2 0 1 0

🚨 New preprint!
How far can we go with ImageNet for Text-to-Image generation? w. @arrijitghosh.bsky.social @lucasdegeorge.bsky.social @nicolasdufour.bsky.social @vickykalogeiton.bsky.social
TL;DR: Train a text-to-image model using 1000 less data in 200 GPU hrs!

📜https://arxiv.org/abs/2502.21318
🧵👇

1 year ago 66 16 2 7

These are some ridiculously good results from training tiny T2I models purely on ImageNet! It's almost too good to be true. Do check it out!

1 year ago 3 2 0 0

Robust Autonomy Emerges from Self-Play Self-play has powered breakthroughs in two-player and multi-player games. Here we show that self-play is a surprisingly effective strategy in another domain. We show that robust and naturalistic drivi...

I've been talking about writing this paper to anyone who would listen since 2020. I bombed a bunch of job talks trying to convince companies to work on this. It's so nice to finally just be able to say, yes, self-play RL in a diverse world gives you immense capabilities
arxiv.org/abs/2502.03349

1 year ago 92 6 3 0

🚨Great Models Think Alike and this Undermines AI Oversight🚨
New paper quantifies LM similarity
(1) LLM-as-a-judge favor more similar models🤥
(2) Complementary knowledge benefits Weak-to-Strong Generalization☯️
(3) More capable models have more correlated failures 📈🙀
🧵👇

1 year ago 21 9 2 1

ReNO shows that some initial noise are better for some prompts! This is great to improve image generation, but i think it also shows a deeper property of diffusion models.

1 year ago 2 2 1 0

This is maybe my favorite thing I've seen out of #NeurIPS2024.

Head over to HuggingFace and play with this thing. It's quite extraordinary.

1 year ago 3 2 0 0

Can we enhance the performance of T2I models without any fine-tuning?

We show that with our ReNO, Reward-based Noise Optimization, one-step models consistently surpass the performance of all current open-source Text-to-Image models within the computational budget of 20-50 sec!
#NeurIPS2024

1 year ago 27 7 1 1

I will present ✌️ BDU workshop papers @ NeurIPS: one by Rui Li (looking for internships) and one by Anton Baumann.

🔗 to extended versions:

1. 🙋 "How can we make predictions in BDL efficiently?" 👉 arxiv.org/abs/2411.18425

2. 🙋 "How can we do prob. active learning in VLMs" 👉 arxiv.org/abs/2412.06014

1 year ago 18 4 1 1

After a break of over 2 years, I'm attending a conference again! Excited to attend NeurIPS, even more so to be presenting ReNO, getting inference-time scaling and preference optimization to work for text-to-image generation.
Do reach out if you'd like to chat!

1 year ago 12 3 0 0

🚨New Paper Alert🚨

🚀 Introducing FlowChef, "Steering Rectified Flow Models in the Vector Field for Controlled Image Generation"! 🌌✨

- Perform image editing, solve inverse problems, and more.
- Achieved inversion-free, gradient-free, & training-free inference time steering! 🤯

👇👇

1 year ago 5 2 1 0

Some recent discussions made me write up a short read on how I think about doing computer vision research when there's clear potential for abuse.

Alternative title: why I decided to stop working on tracking.

Curious about other's thoughts on this.

lb.eyer.be/s/cv-ethics....

1 year ago 174 20 19 7

Check out this nice work by @confusezius.bsky.social on designing VLMs for few-shot adaptation!

1 year ago 5 0 0 0

A real-time (or very fast) open-source txt2video model dropped: LTXV.

HF: huggingface.co/Lightricks/L...
Gradio: huggingface.co/spaces/Light...
Github: github.com/Lightricks/L...

Look at that prompt example though. Need to be a proper writer to get that quality.

1 year ago 89 10 6 1

Learning from One Continuous Video Stream

Learning from one continuous video stream

- use a video stream to learn a predictive model
- everything is in pixel space
- update the model less frequently and don’t use momentum optimizer
- pre training with iid improves performance
- continual learning for robots

arxiv.org/html/2312.00...

1 year ago 18 3 0 0

Tübingen AI Join the conversation

Here's a fledgling starter pack for the AI community in Tübingen. Let me know if you'd like to be added!

go.bsky.app/NFbVzrA

1 year ago 24 13 18 0

Posts by Shyamgopal Karthik