Kwang Moo Yi (@kmyid) Bsky

AnyRecon AnyRecon enable high-quality and large scale 3D reconstruction from sparse inputs.

yutian10.github.io/AnyRecon/

1 hour ago 0 0 0 0

Chen et al., "AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model"

Another "rasterize & fix" method. In addition to the common point map rasterization, add relevant views as anchors in your conditioning signal for conditional video generation.

1 hour ago 0 0 1 0

Latent-Compressed Variational Autoencoder for Video Diffusion Models Video variational autoencoders (VAEs) used in latent diffusion models typically require a sufficiently large number of latent channels to ensure high-quality video reconstruction. However, recent stud...

arxiv.org/abs/2604.16479

1 day ago 0 0 0 0

Guan et al., "Latent-Compressed Variational Autoencoder for Video Diffusion Models"

Video latents show unstructured frequency decomposition, indicating redundancy --> can be removed via wavelet transforms.

1 day ago 1 0 1 0

Towards a Mechanistic Explanation of Diffusion Model Generalization We propose a simple, training-free mechanism which explains the generalization behaviour of diffusion models. By comparing pre-trained diffusion models to their theoretically optimal empirical counter...

Also reminds me of Niedoba et al., “Towards a Mechanistic Explanation of Diffusion Model Generalization” arxiv.org/abs/2411.19339

2 days ago 0 0 0 0

GitHub - briqr/fm_stability Contribute to briqr/fm_stability development by creating an account on GitHub.

github.com/briqr/fm_sta...

2 days ago 0 0 0 0

Briq et al., “The Amazing Stability of Flow Matching”

The attached image explains it all (with minor caption error though) -- training flow matching models with less data, “different” data, disjoint subsets, different architectures all lead to similar results.

2 days ago 2 0 2 0

TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

research.nvidia.com/labs/toronto...

5 days ago 0 0 0 0

Ren and Tyszkiewicz, et al., "TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens"

DETR-style feed-forward 3D GS, with test-time token adaptation.

5 days ago 0 0 1 0

Lyra 2.0: Explorable Generative 3D Worlds Lyra 2.0: Explorable Generative 3D Worlds — camera-controlled walkthrough videos lifted to 3D via feed-forward reconstruction. We address spatial forgetting and temporal drifting for long-horizon, 3D-...

research.nvidia.com/labs/sil/pro...

6 days ago 0 0 0 0

Shen and Ren et al., "Lyra 2.0: Explorable Generative 3D Worlds"

Explicit 3D memory, overlap-based frame extraction for conditioning with dense warping, using 3D Foundational models. I guess we still have need for explicit 3D.

6 days ago 1 0 1 0

PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for 3D Gaussian Splatting 3D Gaussian Splatting (3DGS) has recently unlocked real-time, high-fidelity novel view synthesis by representing scenes using explicit 3D primitives. However, traditional methods often require million...

arxiv.org/abs/2604.09903

1 week ago 0 0 0 0

Tran and Košecká, "PointSplat: Efficient Geometry-Driven Pruning and Transformer Refinement for
3D Gaussian Splatting"

Prune, but then refine Gaussians in 3D with a point transformer.

1 week ago 1 0 1 0

Free-Range Gaussians Project page for Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction.

free-range-gaussians.github.io

1 week ago 0 0 0 0

Shabanov et al., "Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction"

Denosing 3D Gaussians with DiT conditioned on multiview images for 3D assets -- via hierarchical Level of Detail representation for 3D Gaussians.

1 week ago 0 0 1 0

SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting SparseSplat is the first feed-forward 3DGS model that adaptively adjusts Gaussian density using entropy-based sampling, achieving SOTA quality with only 22% of the Gaussians.

victkk.github.io/SparseSplat-...

1 week ago 0 0 0 0

Zhang et al., "SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction"

Feed-forward 3D Gaussian Splatting is often pixel-aligned. Here's a non-aligned one, based on 3D point clouds and anchors.

1 week ago 0 0 1 0

Multimodal Large Language Models Cannot Spot Spatial Inconsistencies Project page for Multimodal Large Language Models Cannot Spot Spatial Inconsistencies.

reachomk.github.io/spatial-incon/

1 week ago 0 0 0 0

Khangaonkar et al., "Multimodal Large Language Models Cannot Spot Spatial Inconsistencies"

A benchmark for spotting spatial inconsistencies. MLLMs are still quite far from being accurate. Reminds me of various automated benchmarks that have recently been shown to be misleading.

1 week ago 2 0 1 0

Watch Before You Answer — Learning from Visually Grounded Post-Training 40–60% of long-video benchmark questions are answerable from text alone. A simple VG-only post-training data filter gives up to +6.2 points with 69.1% of the data.

vidground.etuagi.com

2 weeks ago 0 0 0 0

Zhang et al., "Watch Before You Answer: Learning from Visually Grounded Post-Training"

When post-training VQA, sometimes your answers can be inferred easily **without** the images. These shortcuts are harmful. Removing them can improve performance.

2 weeks ago 1 0 1 0

GitHub - princeton-vl/SimpleProc Contribute to princeton-vl/SimpleProc development by creating an account on GitHub.

github.com/princeton-vl...

2 weeks ago 0 0 0 0

Ma et al., "SimpleProc: Fully Procedural Synthetic Data
from Simple Rules for Multi-View Stereo"

Turns out you can achieve state of the art, or even outperform it, with ZERO real data for multi-view stereo. Which actually does make sense since you are learning to solve geometry.

2 weeks ago 3 1 1 0

UniRecGen: Unifying Multi-View 3D Reconstruction and Generation Sparse-view 3D modeling represents a fundamental tension between reconstruction fidelity and generative plausibility. While feed-forward reconstruction excels in efficiency and input alignment, it oft...

arxiv.org/abs/2604.01479

2 weeks ago 1 0 0 0

Huang et al., "UniRecGen: Unifying Multi-View 3D Reconstruction and Generation"

Estimate 3D point clouds (maps) in both camera and object coordinates, which leads to better multi-view feed-forward reconstruction.

2 weeks ago 0 0 1 0

Video Models Reason Early

video-maze-reasoning.github.io

2 weeks ago 0 0 0 0

Newman et al., "Video Models Reason Early: Exploiting Plan Commitment for Maze Solving"

Similar to the chain of steps paper, the so-called "reasoning" of video models seems to happen early, allowing exploitation of early steps for better problem solving.

2 weeks ago 1 0 1 0

geocodebench

geocodebench.github.io

3 weeks ago 0 0 0 0

Li and Luo et al., "Benchmarking PhD-Level Coding in 3D Geometric Computer Vision"

Am I replaced yet? Not yet! 3D seems still hard. I guess we still have to provide them with tools for now. But for how much longer?

3 weeks ago 0 1 1 0

GitHub - ranrhuang/NAS3R: [CVPR 2026] From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis [CVPR 2026] From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis - ranrhuang/NAS3R

github.com/ranrhuang/NA...

3 weeks ago 0 0 0 0

Posts by Kwang Moo Yi