Advertisement · 728 × 90

Posts by Tony S.F.

Terry Rockafellar's optimization book, the book of Francis Clarke on nonsmooth optimization, and the lecture notes of Michel Coste on o-minimal geometry. Special mention: Philip Isola's computer vision book and Francois Fleuret's little book of deep learning.

2 days ago 2 1 0 0
Preview
On the Role of Batch Size in Stochastic Conditional Gradient Methods We study the role of batch size in stochastic conditional gradient methods under a $μ$-Kurdyka-Łojasiewicz ($μ$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e....

A new paper about how to scale your training of LLMs when increasing the token budget, based on the convergence theory! Lots of empirical experiments validating the assumptions we make. arxiv.org/abs/2603.21191

4 weeks ago 1 0 0 0

they should add reaction emojis to openreview

1 month ago 10 1 1 0
Post image Post image Post image

looks similar to Saclay this morning

1 month ago 1 0 0 0

in my experience convex is rarely, if ever, used in day to day life. most people who arent mathematicians seem unsure of the difference between concave and convex to begin with. not apples to apples imo since increasing is used by everyone pretty regularly, with an agreed upon meaning.

1 month ago 0 0 0 0

The point is that for some conferences (NeurIPS, ICML) reviews are published for rejected papers but not for withdrawn papers; I thought this might be the case. I see from your reasoning why it cannot be the explanation.

1 month ago 1 0 2 0

Does the conference use openreview? Maybe they are evading having the bad reviews published by withdrawing?

1 month ago 1 0 1 0
Post image

So when you're doing muon with weight decay to train nanoGPT you're using frank-wolfe to train a frank-wolfe machine

3 months ago 1 0 0 0
Preview
mHC-lite: You Don't Need 20 Sinkhorn-Knopp Iterations Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks...

easy come easy go? arxiv.org/abs/2601.05732

3 months ago 2 0 1 0
Advertisement
Preview
In the Future All Food Will Be Cooked in a Microwave, and if You Can’t Deal With That Then You Need to Get Out of the Kitchen Update 8/8/2025 – I wrote this the day before a certain post by a popular developer services company. I’ve seen some comments this is a rebuttal – it wasn’t meant to be! But…

I missed this post but it is pure gold. www.colincornaby.me/2025/08/in-t...

4 months ago 11 3 2 0

Our results are for any algorithm that fits the stochastic conditional gradient framework, which includes Muon notably but also normalized SGD, sign SGD, and others (e.g., greedy coordinate descent, low-rank stuff).

5 months ago 0 0 0 0

Yep, none of this is affecting the loss - these regularizers are being added to the computation of the update to your parameters to better model the loss geometry, but they do not affect the loss you want to minimize (ignoring weight decay, which *does* transform unconstrained->constrained).

5 months ago 0 0 1 0

if we ignore the fact that muon is doing adam on some parameters and just focus on the spectral update (thats what you compute with newton schulz) then it's a special case of Scion (which means you constrain the update to be in the spectral ball, blue in the picture).

5 months ago 1 0 1 0

I heard that it's easier to get an h100 on Jean Zay than an a100, kind of funny. The hour multiplier for consumption (i.e. one h100 hour costs 4 credits) should take into account demand.

5 months ago 0 0 0 0

Come check out our #ICCV2025 poster for "Multi-modal Identity Extraction" at (Exhibit Hall I #73).

5 months ago 1 1 0 0

www.arxiv.org/abs/2508.09628

more evidence that frank-wolfe is all you need

5 months ago 0 0 0 0

you can improve your collaborators' writing clarity by being too dumb to fill in the gaps of what they've written, and arguing it must be wrong until they write it clearly enough that even you can understand.

6 months ago 6 0 0 0
Advertisement

Not all DC algorithms I should say but CCCP is equivalent to Frank-Wolfe, proceedings.neurips.cc/paper_files/...

6 months ago 0 0 1 0

Yeah, in this case it does change the stepsize (and therefore the dynamics) even if one assumption implies the other (this was what my collaborators told me when we were first writing our paper). I look forward to learning more about what these guys have done and how much a difference it makes.

6 months ago 1 0 0 0

I started to read this paper arxiv.org/abs/2510.17503 and I thought huh the analysis is so much like Frank-Wolfe, then I remembered that Frank-Wolfe and DC algorithms are dual. Probably, a Frank-Wolfe god like Jaggi knows that but it's not mentioned in the paper; I must be missing something simple.

6 months ago 0 0 1 0

In our L0 L1 smooth work I kept lamenting that L0 L1 smooth on a compact set (like in Frank-Wolfe) implies L smoothness, so it's kind of a pointless assumption. But, if you did the math to derive the short-step, it would give a new, slightly tweaked step size. These guys did exactly that.

6 months ago 2 0 1 0

Have you ever written a paper, and you see a small variation you could easily cover with your analysis etc but you don't do it? But you know if someone else did it right after, you would be upset you didn't include it? It happened to me again today! arxiv.org/abs/2510.16468

6 months ago 3 0 1 0

Frankenstein by Shelley

6 months ago 1 0 1 0

reminds me of "everyone steals, but i have taste!"

6 months ago 4 0 1 0
Advertisement

Abbas Khademi, Antonio Silveti-Falls
Adaptive Conditional Gradient Descent
https://arxiv.org/abs/2510.11440

6 months ago 1 1 0 0

Straight to the top of the "to read" list: arxiv.org/pdf/2510.09034

6 months ago 6 0 2 0

Now accepted at #NeurIPS2025 :)

6 months ago 18 6 2 0

In conditional gradient sliding you are using the conditional gradient algorithm to "chase" the projected Nesterov algorithm. Instead of computing the projection, you do some conditional gradient steps to approximate it. I wonder if you can do the same with FISTA/accelerated proximal point alg ?

6 months ago 0 0 0 0
Little’s Law and Conference Reviewing: the Queueing Perspective TL;DR: This is the queueing model perspective of the “paper pool” conference reviewing model with math and numbers based on Little’s Law. Think of it as supplementary material to the post on David’s b...

Your wish is granted (by Sebastian Pokutta) www.pokutta.com/blog/littles...

6 months ago 2 1 0 0

nerd sniped by the bayesian learning rule again and still unsatisfied... ok, so you can explain a lot of DL optimization algorithms with certain approximations of various posteriors but that's kind of kicking the can down the road - the question becomes: why those approximations instead of others?

7 months ago 1 0 0 0