Posts by Kwang Moo Yi
Chen et al., "AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model"
Another "rasterize & fix" method. In addition to the common point map rasterization, add relevant views as anchors in your conditioning signal for conditional video generation.
Guan et al., "Latent-Compressed Variational Autoencoder for Video Diffusion Models"
Video latents show unstructured frequency decomposition, indicating redundancy --> can be removed via wavelet transforms.
Also reminds me of Niedoba et al., “Towards a Mechanistic Explanation of Diffusion Model Generalization” arxiv.org/abs/2411.19339
Briq et al., “The Amazing Stability of Flow Matching”
The attached image explains it all (with minor caption error though) -- training flow matching models with less data, “different” data, disjoint subsets, different architectures all lead to similar results.
Ren and Tyszkiewicz, et al., "TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens"
DETR-style feed-forward 3D GS, with test-time token adaptation.
Shen and Ren et al., "Lyra 2.0: Explorable Generative 3D Worlds"
Explicit 3D memory, overlap-based frame extraction for conditioning with dense warping, using 3D Foundational models. I guess we still have need for explicit 3D.
Tran and Košecká, "PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for
3D Gaussian Splatting"
Prune, but then refine Gaussians in 3D with a point transformer.
Shabanov et al., "Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction"
Denosing 3D Gaussians with DiT conditioned on multiview images for 3D assets -- via hierarchical Level of Detail representation for 3D Gaussians.
Zhang et al., "SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction"
Feed-forward 3D Gaussian Splatting is often pixel-aligned. Here's a non-aligned one, based on 3D point clouds and anchors.
Khangaonkar et al., "Multimodal Large Language Models Cannot Spot Spatial Inconsistencies"
A benchmark for spotting spatial inconsistencies. MLLMs are still quite far from being accurate. Reminds me of various automated benchmarks that have recently been shown to be misleading.
Zhang et al., "Watch Before You Answer: Learning from Visually Grounded Post-Training"
When post-training VQA, sometimes your answers can be inferred easily **without** the images. These shortcuts are harmful. Removing them can improve performance.
Ma et al., "SimpleProc: Fully Procedural Synthetic Data
from Simple Rules for Multi-View Stereo"
Turns out you can achieve state of the art, or even outperform it, with ZERO real data for multi-view stereo. Which actually does make sense since you are learning to solve geometry.
Huang et al., "UniRecGen: Unifying Multi-View 3D Reconstruction and Generation"
Estimate 3D point clouds (maps) in both camera and object coordinates, which leads to better multi-view feed-forward reconstruction.
Newman et al., "Video Models Reason Early: Exploiting Plan Commitment for Maze Solving"
Similar to the chain of steps paper, the so-called "reasoning" of video models seems to happen early, allowing exploitation of early steps for better problem solving.
Li and Luo et al., "Benchmarking PhD-Level Coding in 3D Geometric Computer Vision"
Am I replaced yet? Not yet! 3D seems still hard. I guess we still have to provide them with tools for now. But for how much longer?