Yash Bhalgat (@ysbhalgat) Bsky

Excited to announce the 1st Workshop on 3D-LLM/VLA at #CVPR2025! 🚀 @cvprconference.bsky.social

Topics: 3D-VLA models, LLM agents for 3D scene understanding, Robotic control with language.

📢 Call for papers: Deadline – April 20, 2025

🌐 Details: 3d-llm-vla.github.io

#llm #3d #Robotics #ai

1 year ago 6 1 0 0

Our beginner's oriented accessible introduction to modern deep RL is now published in Foundations and Trends in Optimization. It is a great entry to the field if you want to jumpstart into RL!
@bernhard-jaeger.bsky.social
www.nowpublishers.com/article/Deta...
arxiv.org/abs/2312.08365

1 year ago 62 14 2 0

SOCIAL MEDIA TITLE TAG SOCIAL MEDIA DESCRIPTION TAG TAG

I think a few things will happen soon:
🚀 Scale beyond 8B
🎯 Multi-modal capabilities
⚡️Faster inference
🔄 Reinforcement learning integration

Exciting to see alternatives to autoregressive models succeeding at scale!

Paper: ml-gsai.github.io/LLaDA-demo/

(8/8)

1 year ago 0 0 0 0

Results vs LLaMA3 8B:

- Matches/exceeds on most tasks
- Better at math & Chinese tasks
- Strong in-context learning
- Improved dialogue capabilities

(7/8) 🧵

1 year ago 0 0 1 0

A major result: LLaDA breaks the "reversal curse" that plagues autoregressive models. 🔄

On tasks requiring bidirectional reasoning, it outperforms GPT-4 and maintains consistent performance in both forward/reverse directions.

(6/8) 🧵

1 year ago 0 0 1 0

For generation, they introduce clever remasking strategies:

- Low-confidence remasking: Remask tokens the model is least sure about

- Semi-autoregressive: Generate in blocks left-to-right while maintaining bidirectional context

(5/8) 🧵

1 year ago 0 0 1 0

Training uses random masking ratio t ∈ [0,1] for each sequence.

The model learns to predict original tokens given partially masked sequences. No causal masking used.

Also enables instruction-conditioned generation with the same technique. No modifications.

(4/8) 🧵

1 year ago 0 0 1 0

💡Core insight: Generative modeling principles, not autoregression, give LLMs their power.

LLaDA's forward process gradually masks tokens while reverse process predicts them simultaneously. This enables bidirectional modeling.

(3/8) 🧵

1 year ago 1 0 1 0

Key highlights:
- Successful scaling of masked diffusion to LLM scale (8B params)
- Masking with variable ratios for forward/reverse process
- Smart remasking strategies for generation, incl. semi-autoregressive
- SOTA on reversal tasks, matching Llama 3 on others

(2/8) 🧵

1 year ago 0 0 1 0

"LLaDA: Large Language Diffusion Models" Nie et al.

Just read this fascinating paper.

Scaled up Masked Diffusion Language Models to 8B params, and show that it can match #LLMs (including Llama 3) while solving some key limitations!

Let's dive in... 🧵

(1/8)

#genai

1 year ago 1 1 1 0

Light-A-VideoClick to Play and Loop VideoClick to Play and Loop VideoClick to Play and Loop VideoClick to Play and Loop VideoClick to Play and Loop VideoClick to Play and Loop Video

Project page: bujiazi.github.io/light-a-vide...
Code: github.com/bcmi/Light-A...

Could be a game-changer for quick video mood/lighting adjustments without complicated VFX pipelines! 🎬

1 year ago 0 0 0 0

The results are pretty good ✨
They can transform regular videos into moody noir scenes, add sunlight streaming through windows, or create cyberpunk neon vibes -- works on everything from portrait videos to car commercials! 🚗

1 year ago 0 0 1 0

Technical highlights 🔍:
- Consistent Light Attention (CLA) module for stable lighting across frames
- Progressive Light Fusion for smooth temporal transitions
- Works with ANY video diffusion model (AnimateDiff, CogVideoX)
- Zero-shot - no fine-tuning needed!

1 year ago 0 0 1 0

New work introduces a training-free method to relight entire videos, while maintaining temporal consistency! 📽️🌅

"Light-A-Video: Training-free Video Relighting via Progressive Light Fusion" Zhou et al.

(1/n) 🧵

#genai #ai #research #video

1 year ago 10 2 1 0

RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets

Project page: liuisabella.com/RigAnything/
Code: not available yet

Really excited to try this out once the code is available!

1 year ago 0 0 0 0

Authors claim that the model generalizes well across diverse shapes - from humanoids to marine creatures! And works with real-world images & arbitrary poses. 🤩

1 year ago 0 0 1 0

Technical highlights:
- BFS-ordered skeleton sequence representation
- Autoregressive joint prediction with diffusion sampling
- Hybrid attention masking: full self-attention for shape tokens, causal attention for skeleton
- e2e trainable pipeline without clustering/MST ops

1 year ago 2 0 1 0

Need to rig 3D models? 🦖

New work from UCSD and Adobe:
"RigAnything: Template-Free Autoregressive Rigging
for Diverse 3D Assets" Liu et al.

tl;dr: reduces rigging time from 2 mins to 2 secs, works on any shape category & doesn't need predefined templates! 🚀

1 year ago 5 0 1 0

Self attention: Merge Query matrix and Key matrix into a single covariance matrix? · rasbt LLMs-from-scratch · Discussion #517 When compute the context vector in the attention algorithm, three weight matrices were introduced. It has discussed in #454 that the value matrix W_V is not necessary. For the rest two, query matri...

@sebastianraschka.com this is such an interesting discussion! I haven't tried this myself, but I think this can be analyzed theoretically by looking at the rank of the attention matrix in both cases.

I have posted my thoughts on the discussion here: github.com/rasbt/LLMs-f...

1 year ago 1 0 0 0

Latent Radiance Fields with 3D-aware 2D Representations

Interesting how they handle the domain gap between 2D latent space and 3D representations through their three-stage pipeline. The correspondence-aware encoding significantly reduces high-frequency noise while preserving geometry.

Project: latent-radiance-field.github.io/LRF/

1 year ago 1 0 0 0

Technical approach:
- Correspondence-aware autoencoding to enhance 3D consistency in VAE latent space
- Builds 3D representations from 3D-aware 2D features
- VAE-Radiance Field alignment to bridge domain gap between latent and image space

#nerf #ai #research

1 year ago 3 0 1 0

"Latent Radiance Fields with 3D-aware 2D Representations" Zhou et al., #ICLR2025

tl;dr: Novel framework that integrates 3D awareness into VAE latent space using correspondence-aware encoding, enabling high-quality rendered images with ~50% memory savings.

(1/n) 🧵

1 year ago 2 0 1 0

Project: research.nvidia.com/labs/dir/edg...
Training and inference code available here: github.com/NVlabs/EdgeR...

1 year ago 0 0 0 0

The architecture uses a lightweight encoder and auto-regressive decoder to compress variable-length meshes into fixed-length codes, enabling point cloud and single-image conditioning.

Their ArAE model controls face count for varying detail while preserving mesh topology.

1 year ago 0 0 1 0

"EdgeRunner" (#ICLR2025) from #Nvidia & PKU introduces an auto-regressive auto-encoder for mesh generation, supporting up to 4000 faces at 512³ resolution. 🤩

Their mesh tokenization algorithm (adapted from EdgeBreaker) achieves ~50% compression (4-5 tokens per face vs 9), making training efficient.

1 year ago 0 0 1 0

Technical highlight: They combine 3D latent diffusion with multi-view conditioning for the base shape, then use 2D normal maps for refinement. The results look way cleaner than previous methods.

1 year ago 0 0 0 0

Their two-stage approach: First generate coarse geometry (5s), then add fine details (20s) using normal maps based refinement. Smart way to balance speed and quality.

1 year ago 0 0 1 0

Just came across this fascinating paper "CraftsMan3D" - a practical approach to text/image-to-3D generation that mimics how artists actually work!

Code available (pretrained models too) 🤩: github.com/wyysf-98/Cra...

(1/n) 🧵

1 year ago 1 0 1 0

Got me excited for a second here 🫠

1 year ago 0 0 0 0

So, what happened this week in #AI?

1 year ago 1 0 0 0

Posts by Yash Bhalgat