Advertisement Β· 728 Γ— 90

Posts by Zihan

Post image Post image

New paper 🚨
"Stable Deep Reinforcement Learning via Isotropic Gaussian Representations"

Deep RL suffers from unstable training, representation collapse, and neuron dormancy. We show that a simple geometric insight, isotropic Gaussian representations, can fix this. Here's how πŸ‘‡

1 month ago 24 4 2 2
Preview
Caption This, Reason That: VLMs Caught in the Middle Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relatio...

9/9
A huge shoutout to my co-authors @lucasmgomez.bsky.social,
@taylorwwebb.bsky.social, and @bashivan.bsky.social!
Check out the full paper for the deep dive into VLM cognitive profiles at arxiv.org/abs/2505.21538
See you in San Deigo! πŸ”οΈ #AI #VLM #NeurIPS2025

4 months ago 3 0 0 0

8/9
Our work suggests that future VLM improvements shouldn't just focus on larger encoders, but on better Visual Chain-of-Thought and integration strategies to overcome the "Perception-Reasoning" disconnect.

4 months ago 0 0 1 0
Post image Post image

7/9
Does this generalize? Yes.
Fine-tuning on our cognitive tasks correlated with improvements on established benchmarks like MMMU-Pro and VQAv2. πŸ“Š

4 months ago 0 0 1 0

6/9
We didn't stop there. We fine-tuned Qwen2.5 on our Composite Visual Reasoning (CVR) tasks.
πŸ”Ή 1k training samples yielded large gains.
πŸ”Ή 100k samples pushed performance even higher.

4 months ago 0 0 1 0

5/9
This suggests a major bottleneck in current VLMs: Chain-of-Thought (CoT) needs to be better grounded in visual features.
Models are "Caught in the Middle"β€”they possess the visual info and the reasoning capacity, but fail to connect them without an explicit text bridge.

4 months ago 0 0 1 0
Post image

4/9
Is the vision encoder causing this gap? No.
We tested Self-Captioning (SC): The model describes the image, then answers the prompt using its own caption.
πŸ‘‰ Qwen2.5-VL-7B Spatial Perception accuracy went from 44% (Base) β†’ 73% (SC). πŸ“ˆ

4 months ago 0 0 1 0
Post image

3/9
The Diagnosis? πŸ₯
VLMs have distinct cognitive profiles.
βœ… Perception: Strong at identifying what an object is (Category).
❌ Spatial: Terrible at identifying where it is (Location).
❌ Attention: They struggle to ignore distractors.

4 months ago 0 0 1 0
Post image

2/9
Human intelligence is built on core abilities: Perception, Attention, and Memory.
Existing VLM benchmarks (MMMU, etc.) test high-level reasoning. We went deeper. We built the PAM Dataset to isolate these low-level cognitive abilities in models like GPT-4o and Qwen2.5-VL.

4 months ago 0 0 1 0
Post image

1/9
🚨Thrilled to share "Caption This, Reason That", a #NeurIPS2025 Spotlight! πŸ”¦
Meet us at #2112, 3 Dec 11 a.m.
We analyze VLM limitations through the lens of Cognitive Science (Perception, Attention, Memory) and propose a simple "Self-Captioning" method that boosts spatial reasoning by ~18%.
πŸ§΅πŸ‘‡

4 months ago 7 2 1 2
Advertisement