Advertisement Ā· 728 Ɨ 90

Posts by Mario

I think many of them are actually quite good artistically

2 months ago 1 0 1 0
Post image Post image Post image Post image

Ruining great art with the nano banana pro command ā€œMake this much more cheerful with as few changes as possibleā€

5 months ago 182 29 11 8
Preview
The Principles of Diffusion Models This monograph presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffu...

"The Principles of Diffusion Models" by Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, Stefano Ermon. arxiv.org/abs/2510.21890
It might not be the easiest intro to diffusion models, but this monograph is an amazing deep dive into the math behind them and all the nuances

5 months ago 37 13 1 1
Post image

The main ingredient that led to GRPO's performance leap is the calibration of the reward/value via multiple rollouts per prompt.

Let me elaborate on what I mean by that and a cheaper way of doing it offline.

8 months ago 31 7 2 4
Preview
Notes on REINFORCE Deep dive into the first policy gradient method.

To understand PPO, GRPO, or any policy-gradient algorithm, you first need to understand REINFORCE. I’ve written my notes here.

8 months ago 0 0 0 0
But how do AI videos actually work? | Guest video by @WelchLabsVideo
But how do AI videos actually work? | Guest video by @WelchLabsVideo YouTube video by 3Blue1Brown

New video on the details of diffusion models: youtu.be/iv-5mZ_9CPY

Produced by Welch Labs, this is the first in a short series of 3b1b this summer. I enjoyed providing editorial feedback throughout the last several months, and couldn't be happier with the result.

8 months ago 145 13 2 3
asking grok 4 for its opinions on israel palestine it first searches to see what Elon musk thinks

asking grok 4 for its opinions on israel palestine it first searches to see what Elon musk thinks

Well, a single week was enough to provide a convincing case that a Wikipedia equivalent for LLMs is necessary i.e. decentralized LLM training and serving

9 months ago 102 17 4 5
A vision researcher’s guide to some RL stuff: PPO & GRPO

This explanation of PPO and GRPO is SUPER clear.

9 months ago 0 0 0 0
Post image

Diffusion models have analytical solutions, but they involve sums over the entire training set, and they don't generalise at all. They are mainly useful to help us understand how practical diffusion models generalise.

Nice blog + code by Raymond Fan: rfangit.github.io/blog/2025/op...

9 months ago 34 3 2 1
Advertisement
RLHF Book by Nathan Lambert The Reinforcement Learning from Human Feedback Book

Sharing what I like in case you do too. Great book, @natolambert.bsky.social. It’s exactly what I was looking for. rlhfbook.com

9 months ago 5 1 0 0

Well, the last thing I said can be done today too, so not a great benefit, maybe something else

9 months ago 0 0 0 0

Yeah, I agree, it’s more out of curiosity to see if interpolation would be ≄, where my guess is that it would be at least on par. Like you said, it might be interesting for some interpretability study, e.g., examining how the AdaLN parameters vary with the interpolation coefficient

9 months ago 0 0 1 0

Interesting that you’re also not seeing a clear reason why this should fail!
Also, thanks for the great links! 2/2

9 months ago 0 0 0 0

Yes, this is exactly my idea, Fourier emb is implicitly making assumptions about which timesteps matter more, and I suspect that makes it harder for downstream transf (AdaLN etc) to use, compared to simple linear interp, considering how these emb are later used. 1/2

9 months ago 0 0 2 0

But since the range is fixed, wouldn’t interpolating two e.g. 256d random (or not) vectors work just as well, or even better, since it doesn’t bias any specific timestep (while standard PE does)? I don’t have an intuition for why this would be wrong. 2/2

9 months ago 0 0 1 0

Thanks @sedielem.bsky.social! I remember this reason for PE. What I’m unsure about is whether it still applies when encoding a timestep as 0-1. In the Flux code it’s first mapped to 0–1000, so I guess your point about needing enough granularity holds. 1/2

9 months ago 0 0 1 0

Any intuition on using max_period=10k for t sinusoidal PE in, e.g, Flux? Since t enters via AdaLN and range is fixed, would linear interpol. two high dim vec work too? Does PE 10k make it easier to distinguish small t and harder for large t? Maybe @sedielem.bsky.social @stefanabaumann.bsky.social

10 months ago 0 0 1 0

No, in this country we don’t tolerate senators being forced to the ground and cuffed for asking questions. We don’t tolerate masked agents rounding people up on the streets and disappearing them in unmarked vans. And we sure as hell don’t tolerate

10 months ago 542 48 8 1
Post image

It is critical for scientific integrity that we trust our measure of progress.

The @lmarena.bsky.social has become the go-to evaluation for AI progress.

Our release today demonstrates the difficulty in maintaining fair evaluations on the Arena, despite best intentions.

11 months ago 42 9 3 4
Advertisement

Black swans are real

11 months ago 0 0 0 0
First Follower: Leadership Lessons from Dancing Guy
First Follower: Leadership Lessons from Dancing Guy YouTube video by Derek Sivers

Harvard is Dancing Guy. Who is the first follower?

m.youtube.com/watch?v=fW8a...

11 months ago 23 5 0 1
Video

Workers are not asking to get rich. They just want to afford three meals a day.

In the richest country in the history of the world, no one should work for starvation wages.

It is time to raise the disgraceful $7.25/hr federal minimum wage to a living wage of AT LEAST $17/hr.

1 year ago 10400 2203 309 107
Preview
Reinforcement Learning from Human Feedback Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle…

rlhfbook also available on arxiv for SEO šŸ˜€ happy friday
arxiv.org/abs/2504.12501

1 year ago 69 13 3 4

Maybe the tradeoff is that adv loss pushes toward the prototype rather than the input, making the image more ā€œrealistic,ā€ while perc loss keeps it close to the input, with less perceptual distortion. If the input is atypical, the two losses might go in different directions and balance each other

1 year ago 2 0 0 0
Preview
Generative modelling in latent space Latent representations for generative models.

New blog post: let's talk about latents!
sander.ai/2025/04/15/l...

1 year ago 74 18 3 5

Human heads are out of fashion with the new regime

1 year ago 0 0 0 0
Advertisement
Post image

🚨 New preprint!
How far can we go with ImageNet for Text-to-Image generation? w. @arrijitghosh.bsky.social @lucasdegeorge.bsky.social @nicolasdufour.bsky.social @vickykalogeiton.bsky.social
TL;DR: Train a text-to-image model using 1000 less data in 200 GPU hrs!

šŸ“œhttps://arxiv.org/abs/2502.21318
šŸ§µšŸ‘‡

1 year ago 66 16 2 7

Some sentences are beautiful, some are not (perhaps because they lack context) – but have we truly read the book? Sometimes I wonder if I’m really savoring beautiful things, or if there are simply too many to pause and fully appreciate the depth of their beauty. 2/2

1 year ago 0 0 0 0

I think the problem with visual art is that you see it immediately. Then you move on to the next piece, and once again, you’ve taken it in within seconds. It’s like opening a book, reading a few sentences here and there, and then closing it. 1/2

1 year ago 0 0 1 0
Preview
How has DeepSeek improved the Transformer architecture? This Gradient Updates issue goes over the major changes that went into DeepSeek’s most recent model.

Very good (technical) explainer answering "How has DeepSeek improved the Transformer architecture?". Aimed at readers already familiar with Transformers.

epoch.ai/gradient-upd...

1 year ago 279 64 6 5