This very nice paper provides some useful pushback against PRH.
To me science is like a damped pendulum, where we need to swing back and forth a few times before converging on truth.
So don't worry PRH fans, I'll be trying to swing us back out of the cave again soon!
Posts by Phillip Isola
I often have that feeling. Weather can trigger it. A foggy morning can take me right back to childhood.
Yeah we compare to that in Table 4, and also to just pass@50.
Those can work pretty well, and people have shown that before, but sampling/selecting in weight space is still quite a bit higher.
Fundamentally I think perturbing inputs/activations/outputs should also work.
We take the answer with the most votes (so it’s plurality voting really).
Most numbers get zero votes.
The outputs are categorical in all these LLM benchmark tasks, or they are integers.
You can take the output string and extract the numerical/categorical answer using a regex.
We want to extend to structured output next!
Yeah those are great and we discuss them in the related works.
The big diff is those posts/papers were on sequential random search (very similar to ES).
Agree, just using the term "scaling law" in what's become the common usage, to refer to an empirical trend relating scale to some measure :)
Maybe we should switch to "Effect of model scale" or something... but I think "scaling law" is actually clearer at this point?
I feel similarly, and use LLMs a ton, except I also think it’s true that they kind of suck you in like a sugary treat, and may leave you sick after. It’s hard not to overuse them. This seems to be true even when you are aware of this. So I respect the choice just not to use them, for now.
Bagging is related but I wouldn‘t say it is bagging. There’s no bootstrap in the traditional sense. Those are nice connections though.
Yeah I think that could be really interesting!
Actually we were looking at that back in around 2020, where the idea was that with a good enough architecture maybe random guessing (RG) works. The low-rank paper even included RG in some expts (fig 5).
But turns out pretraining effect is much bigger.
Fun project with Yulu Gan, who was bold enough to try an idea that's not supposed to work. Also builds on his previous work with other colleagues in arxiv.org/abs/2509.24372.
For me this is a “broken clock is right twice a day” project. If you have an idea you like, but it's a bad idea for the current time, sometimes you can just stick with it until the world changes and ends up in a regime where your idea begins to work.
8/
RandOpt has a neat property: training is fully parallel.
This means that on infinite parallel compute, wall-clock training time ≈ one inference pass (e.g., seconds).
A cost is that test-time requires K passes (but distillation can help).
7/
This has implications for post-training, and we probe these with a simple method we call RandOpt: sample N parameter perturbations at random, select the top K, and ensemble predictions via majority vote.
This performs similarly to GRPO, ES, etc on standard benchmarks.
6/
We call this the *thicket* regime. In this regime, Gaussian weight perturbations have a high chance of improving performance on downstream tasks.
The chance of hitting a “solution” increases with increasing model scale.
So you can’t train a network from scratch with random guessing about the initialization.
*But* things change after pretraining:
Pretraining draws the weights to a region where the surrounding parameter space is in fact abundant with task-improving solutions.
4/
The only downside is it’s hopelessly inefficient, right?
Because parameter space is very high dimensional and good solutions are a tiny fraction of the space. They are like needles in a haystack.
3/
Random guessing (e.g. www.bioinf.jku.at/publications...) is my favorite optimization algorithm.
It works like this: guess a bunch of random parameter vectors, pick the one that works best on the training data.
It’s simple, parallel, few hyperparams, no issue with local minima, …
2/
Sharing “Neural Thickets”. We find:
In large models, the neighborhood around pretrained weights can become dense with task-improving solutions.
In this regime, post-training can be easy; even random guessing works
Paper: arxiv.org/abs/2603.12228
Web: thickets.mit.edu
1/
Absolutely. I find myself sometimes longing for the quiet days when we were making no progress, and then I'm like, wait a second, in those days I dreamed about this!
I try to keep in mind Peter Medawar's statement about scientists: "It is, after all, their professional business to solve problems, not merely to grapple with them."
If LLMs make solving problems easier, and remove some grappling, then that's something I think I have to embrace.
I think of PhD as primarily about learning, but yes it's learning by doing. To graduate you need to generate some new knowledge, but that's more like the proof that you are done, rather than the main goal of the process.
I agree. For example, I support bans on certain kinds of military and surveillance use. I'm generally pro regulation of this tech.
Oh maybe this is better put like this:
"Ask not if AI is good or bad, ask what you can do to make it better"
The AI discourse sometimes seems to center on "Is AI good or is it bad?"
I find this framing unproductive. AI is not a fixed thing.
I would prefer to ask "How might we use this technology for good, and mitigate the bad?"
What a shame if the best use we can come up with is no use at all.
I agree, I don't think it's a small difference, or rather this isn't some finicky hyperparameter. Different merics make different claims about the kinds of structure that are converging vs not.
I think it's actually more than topology that is converging, and also includes local-ish geometry...
The reason was because we were seeing noisy trends with CKA. We didn't realize the extent of the bias but had lots of misgivings with it as a metric. Local similarity (mknn) showed clearer trends and made more sense to us. I think the ARH paper does a nice job justifying this more clearly.
This isn't quite right: the PRH paper used the same measure of local similarity, mknn, as the new paper advocates (in fact we introduced it). The main paper results were entirely using mknn. In the appendix we did compare to CKA.
But I agree, lots to refine in how we do rep similarity analysis!
Today we present a new framework for measuring human-like general intelligence in machines: studying how and how well they play and learn to play all conceivable human games compared to humans. We then propose the AI Gamestore a way to sample from popular human games to evaluate AI models.