InteractVLM (#CVPR2025) is a great collaboration MPI-IS, UvA and Inria.
Authors: @saidwivedi.in, @anticdimi.bsky.social, S. Tripathi, O. Taheri, C. Schmid, @michael-j-black.bsky.social and @dimtzionas.bsky.social.
Code & models available at: interactvlm.is.tue.mpg.de (10/10)
Posts by Sai Kumar Dwivedi
InteractVLM is the first method that infers 3D contacts on both humans and objects from in-the-wild images, and exploits these for 3D reconstruction via an optimization pipeline. In contrast, existing methods like PHOSA rely on handcrafted or heuristic-based contacts. (9/10)
With just 5% of DAMON’s 3D body contact annotations, InteractVLM surpasses the fully-supervised DECO baseline trained on 100% of 3D annotations. This is promising for minimizing reliance on costly 3D data by using foundational models. (8/10)
InteractVLM also shows strong outperformance on object affordance prediction on the PIAD dataset. Here affordance is defined as contact probabilities on the object. (7/10)
InteractVLM significantly outperforms prior work, both qualitatively and quantitatively, on in-the-wild 3D human (binary & semantic) contact prediction on the DAMON dataset. (6/10)
To bridge this 2D-to-3D gap, we propose "Render-Localize-Lift":
- Render: 3D human/object meshes into multiview 2D images.
- Localize: A Multiview Localization (MV-Loc) model, guided by VLM tokens, predicts 2D contact masks.
- Lift: 2D contact masks to 3D.
(5/10)
How can we infer 3D contact with limited 3D data? InteractVLM exploits foundational models—a VLM & localization model fine tuned to reason about contact. Given an image & prompt, the VLM outputs tokens for localization. But these models work in 2D, while contact is 3D. (4/10)
Furthermore, simple binary contact (touching “any” object) misses the rich semantics of real multi-object interactions. Thus, we introduce a novel task - "Semantic Human Contact" estimation: predicting contact points on a human related to a specified object. (3/10)
Precisely inferring where humans contact objects from an image is hard due to occlusion & depth ambiguity. Current datasets of images with 3D contact are small as they’re costly & tedious to create (mocap/manual labeling), limiting performance of contact predictors. (2/10)
Why does 3D human-object reconstruction fail in the wild or get limited to a few object classes? A key missing piece is accurate 3D contact. InteractVLM (#CVPR2025) uses foundational models to infer contact on humans & objects, improving reconstruction from a single image. (1/10)
✨ Happy to be recognised again as an Outstanding Reviewer for #CVPR2025!
Thanks to the workshop organizers: @yixinchen.bsky.social, Baoxiong Jia, @yaoyaofeng.bsky.social, @songyoupeng.bsky.social, Chuhang Zou, @saidwivedi.in, Yixin Zhu, Siyuan Huang! 🙌
And the challenge organizers: Xiongkun Linghu, Tai Wang, Jingli Lin, Xiaojian Ma
📢 Excited to announce the 5th Workshop on 3D Scene Understanding for Vision, Graphics & Robotics at #CVPR2025! We’ll dive into multimodal 3D scene understanding & reasoning with amazing speakers and challenges.
@cvprconference.bsky.social
More Details: scene-understanding.com.
I've been using GitHub's Lists feature for over a year, and it's seriously underrated! ⭐
It lets you assign labels to all your starred repos, making it super easy to find projects later based on specific fields or topics. No more endless scrolling!
Link to my list: github.com/saidwivedi?t...
📢 I am #hiring 2x #PhD candidates to work on Human-centric #3D #ComputerVision at the University of #Amsterdam!
The positions are funded by an #ERC #StartingGrant.
For details and for submitting your application please see:
werkenbij.uva.nl/en/vacancies...
🆘 Deadline: Feb 16 🆘
Thanks for sharing :) @chrisoffner3d.bsky.social can you also please add me to the list? I work on 3D human avatar.
One of the best tutorials for understanding Transformers!
📽️ Watch here: www.youtube.com/watch?v=bMXq...
Big thanks to @giffmana.ai for this excellent content! 🙌
Would love to be in the list 😃
For those who missed this post on the-network-that-is-not-to-be-named, I made public my "secrets" for writing a good CVPR paper (or any scientific paper). I've compiled these tips of many years. It's long but hopefully it helps people write better papers. perceiving-systems.blog/en/post/writ...