Having cross-entropy as a default is great, and is really nice for unlabeled data since it all tenda to fall out pretty nicely w.r.t. the learning process, but it is inherently (and necessarily) a much more expensive way to learn a target distribution of values.
Posts by Fern
Having (a good set of) crowdsourced values for a KL divergence would reduce this variance a bit, and also would give a better value to measure against, due to not being as noisy (in both bias _and_ variance -- a bit of a messy combo to deal with).
When aiming for a 94% accuracy (~6% error rate), this means that that 9% of the remaining labels are "bad", from a cross-entropy perspective.
This is quite a lot! And partially one thing that made testing speedrun results more difficult.
Variance can be a problem in testing models, which extends iterative research cycle length due to needing to run more experiments.
One paper that covered this, arxiv.org/abs/2103.14749, estimated the CIFAR-10 error rate to be at about .54% or so.
A multiple choice question with an apparently incorrect answer chosen as the correct answer. Img source: @whybyfire.bsky.social
This is a classic example of _why_ choose-one-of-n datasets need to have large-scale, crowd-sourced statistics and should use the KL-divergence instead of cross-entropy.
Reviewers will be more biased than a crowd, it's a high variance+bias estimator, it can harm research.
Transcript of Hard Fork ep 111: Yeah. And I could talk for an hour about transformers and why they are so important. But I think it's important to say that they were inspired by the alien language in the film Arrival, which had just recently come out. And a group of researchers at Google, one researcher in particular, who was part of that original team, was inspired by watching Arrival and seeing that the aliens in the movie had this language which represented entire sentences with a single symbol. And they thought, hey, what if we did that inside of a neural network? So rather than processing all of the inputs that you would give to one of these systems one word at a time, you could have this thing called an attention mechanism, which paid attention to all of it simultaneously. That would allow you to process much more information much faster. And that insight sparked the creation of the transformer, which led to all the stuff we see in Al today.
Did you know that attention across the whole input span was inspired by the time-negating alien language in Arrival? Crazy anecdote from the latest Hard Fork podcast (by @kevinroose.com and @caseynewton.bsky.social). HT nwbrownboi on Threads for the lead.
i love science
Okay, that is definitely way too aggressive. Hopefully it's not the case that it's like that long-term -- my hope is that with pushback against overzealous moderation that they change their stance on things like this. Should not be an auto-ban.
Yeah that's why he checks it twice, gotta be something like an approximate Radix sort followed by an insertion sort, I'd guess. P efficient maybe?
it's crazy to me that RoPE's issue with BF16 wasn't noticed earlier.
For a reasonable N of 2048, these are the computed frequencies prior to cos(x) & sin(x) for fp32 above and bf16 below.
Given how short the period is of simple trig functions, this difference is catastrophic for large values.
Having flexattention w/ DDP is really nice
Also the foreach norm bug is apparently a bother to a few people
story of my life
Just added FSDP2 support for MARS and Muon!
Thanks for 100 followers, y'all! Happened so fast and can't wait to put out more research on here! 😊❤️
(And one more thing -- if this was the other site, I'd place a note saying to come and follow me on here. But we're already here, woo! Congratulations us. ❤️
Feel free to drop me a message and say hi, I'd love to chat! ❤️👍)
My time is funded by a combination of personal consulting/contracting work I take, as well as the financial support of others to enable this kind of open source work.
If you'd like to help sponsor my time, check out at github.com/sponsors/tys... (or feel free to drop me a DM, would love to chat!)
Finally, if you'd like to help see more open-source research work like this, consider sponsoring my time!
Funding for time as well as Colab compute for this work was generously supported by my supporters on Patreon (@jamorton_, @go2carter.bsky.social, @sroeker, @baberb.bsky.social, @chhillee.bsky.social, and @haitchai), as well as the generous support of @algomancer.bsky.social as well.
And this is not all from the research from compute provided from them, I've got more in the pipeline to come! Keep an eye out. ;)
First, my sincerest thanks to @leonardoai.bsky.social with the help of
@ethansmith2000.com for generously providing H100s to support this research to enable this release. Y'all rock, thanks so much! <3
Thanks to FlexAttention (thanks @chhillee.bsky.social and folks), this was very straightforward to implement via attention masking.
Great to be able to port some of that work to this speedrun and see it fly! <3 :)
Fun fact! This is actually a spiritual port of the sequence_length warmup originally implemented in hlb_gpt earlier last year. However, this was extremely hard to do until now due to the nature of how torch.compile worked.
Lowering Adam betas 0.9->0.8 to be a bit more nimble, as well as shortening the momentum warmup in a requisite manner, as well as increasing the number of cooldown steps for the network.
Some of the other changes include some hyperparameter changes to accommodate the increasingly-shortening learning schedules (1750 now vs 3000 two records ago!).
...This means we need an extra step for traditional comparisons, but also shortens the dev loop and biases us towards favoring longer-context learning)
(Now, a note on scoring -- this is because the modded-nanoGPT speedrun scores the model by the lowest possible validation loss as long as it is causal, vs a fixed attention context length...
As a result of the better match between context length during training and the network's learning dynamics, we can train a bit faster! With the increased context length, we are able to remove 125 steps, making this network 6.7% more data efficient than before.
(This also means that for this record, not only is the number of steps is lower -- the average stepsize is faster as well!)
The previous record used a sliding window size of 1024, whereas this record starts at a window size of 64 and linearly anneals it to 1792 over the course of training. This means we can spend more time learning longer context lengths where it matters -- at the end of training!
This record modifies the previous FlexAttention record by adding a warmup to the sliding window size over the course of training. Since our network can't necessarily utilize long-term connections at the start of training, this yields a pretty good boost to training speed!