Advertisement · 728 × 90

Posts by Dimitrije Antić

Post image

To bridge this 2D-to-3D gap, we propose "Render-Localize-Lift":
- Render: 3D human/object meshes into multiview 2D images.
- Localize: A Multiview Localization (MV-Loc) model, guided by VLM tokens, predicts 2D contact masks.
- Lift: 2D contact masks to 3D.
(5/10)

10 months ago 1 1 1 0
Post image

How can we infer 3D contact with limited 3D data? InteractVLM exploits foundational models—a VLM & localization model fine tuned to reason about contact. Given an image & prompt, the VLM outputs tokens for localization. But these models work in 2D, while contact is 3D. (4/10)

10 months ago 1 1 1 0
Video

Why does 3D human-object reconstruction fail in the wild or get limited to a few object classes? A key missing piece is accurate 3D contact. InteractVLM (#CVPR2025) uses foundational models to infer contact on humans & objects, improving reconstruction from a single image. (1/10)

10 months ago 5 2 1 0

📢 Short deadline extension (24/2) -- One more week left to submit your application!

1 year ago 6 2 0 0

Passionate about Human-centric Computer Vision? 📸🤖
We’re looking for motivated PhD candidates to join our dynamic team! 🚀

1 year ago 2 0 0 0