Advertisement Β· 728 Γ— 90

Posts by Nicolas Dufour

Post image

Excited to share my work as a Student Researcher at Google Zurich: UniGeoCLIP! πŸŒπŸš€

W/ Eduard Trulls, Jan Hosang, @loicland.bsky.social
& @pesarlin.bsky.social , we built a framework aligning 5 geospatial modalities in one space.

Presented at EarthVision @ #CVPR2026. πŸ§΅πŸ‘‡

5 days ago 11 5 1 0
Preview
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

New paper: Back into Plato’s Cave

Are vision and language models converging to the same representation of reality? The Platonic Representation Hypothesis says yes. BUT we find the evidence for this is more fragile than it looks.

Project page: akoepke.github.io/cave_umwelten/

1/9

4 days ago 55 15 2 4

Checkout our recent work, where we only need web images to learn a novel view generation model! We can navigate inside any image, without any video/multi view data or prior models!

Congrats to Adrien for this great first PhD paper!
(with @davidpicard.eurosky.social and @ptrkprz.bsky.social)

4 days ago 14 2 0 0
OVIE: One View Is Enough! Our mission is to build and democratize artificial general intelligence through open science.

I thought I would do a thread, but honestly the post is so good: kyutai.org/blog/2026-04...

It explains "One View Is Enough! Monocular Training for In-the-Wild Novel View Generation" arxiv.org/abs/2603.23488 done in colab with the smart people at kyutai

5 days ago 16 4 0 0
Preview
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a comp...

🚨 arxiv.org/abs/2604.06129

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

This paper is the result of doing a lab-wide hackathon on an idea I've had for some time. Probably the paper with the highest number of authors I've ever done.

It's a CVPR Findings 26.

Thread πŸ§΅πŸ‘‡

1 week ago 58 18 4 2
CVPR@Paris 2026 June 1st

🚨 Happy to announce CVPR@Paris'26 which will take place on June 1st in Paris. The goal of the event is to share a little bit of the conference before it happens. We will have poster sessions as well as several plenary talks by world-class speakers.

info: cvprinparis.github.io/CVPR2026InPa...

2 weeks ago 47 12 4 0
Preview
One View Is Enough! Monocular Training for In-the-Wild Novel View Generation Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, ...

arxiv.org/abs/2603.23488
πŸ‘€ I'll make a detailed thread later

3 weeks ago 17 4 0 0

I'm commenting that number on slack with @nicolasdufour.bsky.social and I just realized that if you add the 16k active submissions at CVPR, even considering a sizeable overlap between the 2, there are currently well over 30k active papers in review.

That's nuts

2 months ago 8 2 1 0

Sadly i don't think DroPE will work for images / videos.
Both NoPE and DroPE rely on the causal mask to leak absolute PE. The number of tokens in the attention gets leaked because you can encode a bias that grows with the number of tokens.
So not a fix for images yet =(

3 months ago 1 0 0 0

It was a big pleasure to be in Nicolas's committee. Congratulations to Nicolas for the great work, and congratulations to the advisors too!

4 months ago 5 1 0 0
Advertisement

Apparently some people reported knowing of the bug before 11th of november so even before the release of the reviews

4 months ago 0 0 0 0

Yesterday, @nicolasdufour.bsky.social defended is PhD. I really enjoyed the years of collaboration w/ @vickykalogeiton.bsky.social (& @loicland.bsky.social)

Video: youtube.com/live/DXQ7FZA...

Big thanks to the jury @dlarlus.bsky.social @ptrkprz.bsky.social @gtolias.bsky.social A. Efros & T. Karras

4 months ago 28 3 1 1

Congrats Nicolas ! On the PhD and on those beautifully crafted slides 🀩

4 months ago 7 1 0 0
Post image

Nicolas ( @nicolasdufour.bsky.social ) is defending his PhD right now.

I was so in awe of the presentation that I even forgot to take pictures πŸ˜…

4 months ago 27 2 2 1

Yes it's latent space just because i had my setup that way. Might try in pixel space in the future.

5 months ago 0 0 0 0

Yes it's the raw prediction, we predict the velocity directly

5 months ago 1 0 1 0

It's also very domain dependent. I know that for example, x-pred works better than epsilon pred for human motion generation.

5 months ago 2 0 1 0

Epsilon loss was used for a while for image generation since DDPM.
Recently it was more flow matching (or v-loss) that is mostly used since SD3 basically.
From my experience, flow doesn't really improve quality, but sampling in fewer steps works better than epsilon prediction

5 months ago 4 0 1 0
Preview
Don't drop your samples! Coherence-aware training benefits Conditional diffusion Conditional diffusion models are powerful generative models that can leverage various types of conditional information, such as class labels, segmentation masks, or text captions. However, in many rea...

Thanks for the pointer! We were doing something similar in "Don't drop your samples" (arxiv.org/abs/2405.20324)

MIRO is quite different in the sense we focus on improving pretraining (not finetuning). Also, we explore the advantages of having multiple rewards to push the Pareto frontier.

5 months ago 3 0 1 0
Advertisement

Yes, thanks for pointing it out, will try to clarify

5 months ago 2 0 0 0

Check out our new work: MIRO

No more post-training alignment!
We integrate human alignment right from the start, during pretraining!

Results:
✨ 19x faster convergence ⚑
✨ 370x less compute πŸ’»

πŸ”— Explore the project: nicolas-dufour.github.io/miro/

5 months ago 9 3 0 0

Image generation becomes much more energy efficient. πŸ‘

5 months ago 5 2 0 0

I'm super happy about Nicolas' latest work, probably the magnum opus of his PhD.

Read the thread for all the great details.
The main conclusion I draw from this work is that better pretraining, in particular by conditioning on better data, allows us to train SOTA models at a fraction of the cost.

5 months ago 30 4 0 0

Work with @lucasdegeorge.bsky.social @arrijitghosh.bsky.social @vickykalogeiton.bsky.social and @davidpicard.bsky.social.

This will be the last work of my PhD as I will be defending the 26th of November!

5 months ago 13 0 0 0
Preview
MIRO: Multi-Reward Conditioning for Efficient Text-to-Image Generation Train once, align many rewards. MIRO achieves 19Γ— faster convergence and 370Γ— less compute than FLUX while reaching GenEval score of 75. Controllable trade-offs at inference time.

MIRO demonstrates that aligning T2I models during pretraining is not only viable but superior: it's faster, more compute-efficient, and provides fine-grained, interpretable control.

Project page for all the details: nicolas-dufour.github.io/miro

5 months ago 8 0 1 0
Post image Post image

The explicit reward conditioning allows for flexible trade-offs, like optimizing for GenEval by reducing the aesthetic weight in the prompt. We can also isolate the look of a specific reward or interpolate them via multi-reward classifier-free guidance

5 months ago 7 0 1 0
Post image

MIRO excels on challenging compositional tasks (Geneval here)

The multi-reward conditioning fosters better understanding of complex spatial relationships and object interactions.

5 months ago 7 0 1 0
Post image

Despite being a compact model (0.36B parameters), MIRO achieves state-of-the-art results:

GenEval score of 75, outperforming the 12B FLUX-dev (67) for 370x less inference cost.
Conditioning on rich reward signals is a highly effective way to achieve large model capabilities in a compact form!

5 months ago 9 0 1 0
Advertisement
Post image

MIRO dramatically improves sample efficiency for test-time scaling.

On PickScore, MIRO needs just 4 samples to match the baseline's 128 samples (a 32x efficiency gain).
For ImageReward, it's a 16x efficiency gain

This demonstrates superior inference-time efficiency for high-quality generation.

5 months ago 8 0 1 0
Post image

Traditional single-objective optimization often leads to reward hacking. MIRO's multi-dimensional conditioning naturally prevents this by requiring the model to balance multiple objectives simultaneously. This produces balanced, robust performance across all metrics contrary to single rewards.

5 months ago 10 0 1 0