Fern (@fernbear) Bsky

Having cross-entropy as a default is great, and is really nice for unlabeled data since it all tenda to fall out pretty nicely w.r.t. the learning process, but it is inherently (and necessarily) a much more expensive way to learn a target distribution of values.

1 year ago 1 0 0 0

Having (a good set of) crowdsourced values for a KL divergence would reduce this variance a bit, and also would give a better value to measure against, due to not being as noisy (in both bias _and_ variance -- a bit of a messy combo to deal with).

1 year ago 1 0 1 0

When aiming for a 94% accuracy (~6% error rate), this means that that 9% of the remaining labels are "bad", from a cross-entropy perspective.

This is quite a lot! And partially one thing that made testing speedrun results more difficult.

1 year ago 1 0 1 0

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect b...

Variance can be a problem in testing models, which extends iterative research cycle length due to needing to run more experiments.

One paper that covered this, arxiv.org/abs/2103.14749, estimated the CIFAR-10 error rate to be at about .54% or so.

1 year ago 1 0 1 0

A multiple choice question with an apparently incorrect answer chosen as the correct answer. Img source: @whybyfire.bsky.social

This is a classic example of _why_ choose-one-of-n datasets need to have large-scale, crowd-sourced statistics and should use the KL-divergence instead of cross-entropy.

Reviewers will be more biased than a crowd, it's a high variance+bias estimator, it can harm research.

1 year ago 2 0 1 0

Transcript of Hard Fork ep 111: Yeah. And I could talk for an hour about transformers and why they are so important. But I think it's important to say that they were inspired by the alien language in the film Arrival, which had just recently come out. And a group of researchers at Google, one researcher in particular, who was part of that original team, was inspired by watching Arrival and seeing that the aliens in the movie had this language which represented entire sentences with a single symbol. And they thought, hey, what if we did that inside of a neural network? So rather than processing all of the inputs that you would give to one of these systems one word at a time, you could have this thing called an attention mechanism, which paid attention to all of it simultaneously. That would allow you to process much more information much faster. And that insight sparked the creation of the transformer, which led to all the stuff we see in Al today.

Did you know that attention across the whole input span was inspired by the time-negating alien language in Arrival? Crazy anecdote from the latest Hard Fork podcast (by @kevinroose.com and @caseynewton.bsky.social). HT nwbrownboi on Threads for the lead.

1 year ago 247 53 19 17

i love science

1 year ago 0 0 0 0

Okay, that is definitely way too aggressive. Hopefully it's not the case that it's like that long-term -- my hope is that with pushback against overzealous moderation that they change their stance on things like this. Should not be an auto-ban.

1 year ago 4 0 1 0

Yeah that's why he checks it twice, gotta be something like an approximate Radix sort followed by an insertion sort, I'd guess. P efficient maybe?

1 year ago 0 0 0 0

it's crazy to me that RoPE's issue with BF16 wasn't noticed earlier.
For a reasonable N of 2048, these are the computed frequencies prior to cos(x) & sin(x) for fp32 above and bf16 below.
Given how short the period is of simple trig functions, this difference is catastrophic for large values.

1 year ago 8 1 1 1

Having flexattention w/ DDP is really nice

Also the foreach norm bug is apparently a bother to a few people

1 year ago 2 0 1 0

story of my life

1 year ago 1 0 0 0

Just added FSDP2 support for MARS and Muon!

1 year ago 8 2 1 0

Thanks for 100 followers, y'all! Happened so fast and can't wait to put out more research on here! 😊❤️

1 year ago 7 0 1 0

(And one more thing -- if this was the other site, I'd place a note saying to come and follow me on here. But we're already here, woo! Congratulations us. ❤️

Feel free to drop me a message and say hi, I'd love to chat! ❤️👍)

1 year ago 1 0 0 0

Sponsor @tysam-code on GitHub Sponsors Support open-source research and human-accessible neural network speedrunning benchmarks

My time is funded by a combination of personal consulting/contracting work I take, as well as the financial support of others to enable this kind of open source work.
If you'd like to help sponsor my time, check out at github.com/sponsors/tys... (or feel free to drop me a DM, would love to chat!)

1 year ago 1 0 1 0

Finally, if you'd like to help see more open-source research work like this, consider sponsoring my time!

1 year ago 0 0 1 0

Funding for time as well as Colab compute for this work was generously supported by my supporters on Patreon (@jamorton_, @go2carter.bsky.social, @sroeker, @baberb.bsky.social, @chhillee.bsky.social, and @haitchai), as well as the generous support of @algomancer.bsky.social as well.

1 year ago 4 0 1 0

And this is not all from the research from compute provided from them, I've got more in the pipeline to come! Keep an eye out. ;)

1 year ago 0 0 1 0

First, my sincerest thanks to @leonardoai.bsky.social with the help of
@ethansmith2000.com for generously providing H100s to support this research to enable this release. Y'all rock, thanks so much! <3

1 year ago 2 1 1 0

Thanks to FlexAttention (thanks @chhillee.bsky.social and folks), this was very straightforward to implement via attention masking.

Great to be able to port some of that work to this speedrun and see it fly! <3 :)

1 year ago 0 0 1 0

Fun fact! This is actually a spiritual port of the sequence_length warmup originally implemented in hlb_gpt earlier last year. However, this was extremely hard to do until now due to the nature of how torch.compile worked.

1 year ago 1 0 1 0

Lowering Adam betas 0.9->0.8 to be a bit more nimble, as well as shortening the momentum warmup in a requisite manner, as well as increasing the number of cooldown steps for the network.

1 year ago 1 0 1 0

Some of the other changes include some hyperparameter changes to accommodate the increasingly-shortening learning schedules (1750 now vs 3000 two records ago!).

1 year ago 2 0 1 0

...This means we need an extra step for traditional comparisons, but also shortens the dev loop and biases us towards favoring longer-context learning)

1 year ago 2 0 1 0

(Now, a note on scoring -- this is because the modded-nanoGPT speedrun scores the model by the lowest possible validation loss as long as it is causal, vs a fixed attention context length...

1 year ago 2 0 1 0

As a result of the better match between context length during training and the network's learning dynamics, we can train a bit faster! With the increased context length, we are able to remove 125 steps, making this network 6.7% more data efficient than before.

1 year ago 2 0 1 0

(This also means that for this record, not only is the number of steps is lower -- the average stepsize is faster as well!)

1 year ago 2 0 1 0

The previous record used a sliding window size of 1024, whereas this record starts at a window size of 64 and linearly anneals it to 1792 over the course of training. This means we can spend more time learning longer context lengths where it matters -- at the end of training!

1 year ago 2 0 1 0

This record modifies the previous FlexAttention record by adding a warmup to the sliding window size over the course of training. Since our network can't necessarily utilize long-term connections at the start of training, this yields a pretty good boost to training speed!

1 year ago 3 0 1 0

Posts by Fern