Preprint now on ArXiv π’
The N-Body Problem: Parallel Execution from Single-Person Egocentric Video
Input: Single-person egocentric video π€
Out: imagine how these tasks can be performed faster by N > 1 people, correctly e.g. N=2 π₯
π arxiv.org/abs/2512.11393
π zhifanzhu.github.io/ego-nbody/
1/4
Posts by Gabriele Goletto
Yes please! The animations look really clear to me so it would be a great learning resource with voiceover π
Now on ArXiv our
@cvprconference.bsky.social
#CVPR2025 paper
Learning from Streaming Video with Orthogonal Gradients
Instead of shuffling clips, can we learn from videos fed sequentially, where you see a clip once, in order?
How to deal with the correlation of gradients over training?
1/3
But I like the (almost) bot-free conversations and there are some really good active accounts!
Check out Kostaβs starter packs (go.bsky.app/M7HGC3Y), thatβs the fastest route. That said, unfortunately, the CV community here has become less active compared to a few months ago.
Image segmentation doesnβt have to be rocket science. π
Why build a rocket engine full of bolted-on subsystems when one elegant unit does the job? π‘
Thatβs what we did for segmentation.
β
Meet the Encoder-only Mask Transformer (EoMT): tue-mps.github.io/eomt (CVPR 2025)
(1/6)
Excited to release the first worldwide aerial image localization method (and demo!)
Take an aerial or satellite image from anywhere in the world, and AstroLoc can (probably) find its location, and provide a precise footprint!
Links to paper, demo and full-length (5 min) video β¬οΈ
ππ’
HD-EPIC: A Highly-Detailed Egocentric Video Dataset
hd-epic.github.io
arxiv.org/abs/2502.04144
New collected videos
263 annotations/min: recipe, nutrition, actions, sounds, 3D object movement &fixture associations, masks.
26K VQA benchmark to challenge current VLMs
1/N
Now on ArXiv
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
arxiv.org/abs/2412.01987
soczech.github.io/showhowto/
Given one real image &variable sequence of text instructions, ShowHowTo generates a multi-step sequence of images *conditioned on the scene in the REAL image*
π§΅
Hi Kosta, would love to be on this list as well π I am working on egocentric video understanding