Zihan (@zhweng) Bsky

New paper 🚨
"Stable Deep Reinforcement Learning via Isotropic Gaussian Representations"

Deep RL suffers from unstable training, representation collapse, and neuron dormancy. We show that a simple geometric insight, isotropic Gaussian representations, can fix this. Here's how 👇

1 month ago 24 4 2 2

Caption This, Reason That: VLMs Caught in the Middle Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relatio...

9/9
A huge shoutout to my co-authors @lucasmgomez.bsky.social,
@taylorwwebb.bsky.social, and @bashivan.bsky.social!
Check out the full paper for the deep dive into VLM cognitive profiles at arxiv.org/abs/2505.21538
See you in San Deigo! 🏔️ #AI #VLM #NeurIPS2025

4 months ago 3 0 0 0

8/9
Our work suggests that future VLM improvements shouldn't just focus on larger encoders, but on better Visual Chain-of-Thought and integration strategies to overcome the "Perception-Reasoning" disconnect.

4 months ago 0 0 1 0

7/9
Does this generalize? Yes.
Fine-tuning on our cognitive tasks correlated with improvements on established benchmarks like MMMU-Pro and VQAv2. 📊

4 months ago 0 0 1 0

6/9
We didn't stop there. We fine-tuned Qwen2.5 on our Composite Visual Reasoning (CVR) tasks.
🔹 1k training samples yielded large gains.
🔹 100k samples pushed performance even higher.

4 months ago 0 0 1 0

5/9
This suggests a major bottleneck in current VLMs: Chain-of-Thought (CoT) needs to be better grounded in visual features.
Models are "Caught in the Middle"—they possess the visual info and the reasoning capacity, but fail to connect them without an explicit text bridge.

4 months ago 0 0 1 0

4/9
Is the vision encoder causing this gap? No.
We tested Self-Captioning (SC): The model describes the image, then answers the prompt using its own caption.
👉 Qwen2.5-VL-7B Spatial Perception accuracy went from 44% (Base) → 73% (SC). 📈

4 months ago 0 0 1 0

3/9
The Diagnosis? 🏥
VLMs have distinct cognitive profiles.
✅ Perception: Strong at identifying what an object is (Category).
❌ Spatial: Terrible at identifying where it is (Location).
❌ Attention: They struggle to ignore distractors.

4 months ago 0 0 1 0

2/9
Human intelligence is built on core abilities: Perception, Attention, and Memory.
Existing VLM benchmarks (MMMU, etc.) test high-level reasoning. We went deeper. We built the PAM Dataset to isolate these low-level cognitive abilities in models like GPT-4o and Qwen2.5-VL.

4 months ago 0 0 1 0

1/9
🚨Thrilled to share "Caption This, Reason That", a #NeurIPS2025 Spotlight! 🔦
Meet us at #2112, 3 Dec 11 a.m.
We analyze VLM limitations through the lens of Cognitive Science (Perception, Attention, Memory) and propose a simple "Self-Captioning" method that boosts spatial reasoning by ~18%.
🧵👇

4 months ago 7 2 1 2

Posts by Zihan