After spending two months trying out different interventions on my GPT-2-style model, it was time to try stacking them up. Interesting results!
www.gilesthomas.com/2026/04/llm-...
Posts by Giles Thomas
After spending two months trying out different interventions on my GPT-2-style model, it was time to try stacking them up. Interesting results!
www.gilesthomas.com/2026/04/llm-...
I wanted to see whether my results when testing interventions to my GPT-2-style training loop were signal or noise. The results were promising!
www.gilesthomas.com/2026/04/llm-...
I wanted to see whether my results when testing interventions to my GPT-2-style training loop were signal or noise. The results were promising!
www.gilesthomas.com/2026/04/llm-...
My final intervention test, in which I discover that there is such a thing as a free lunch, and it's called AMP.
www.gilesthomas.com/2026/04/llm-...
My final intervention test, in which I discover that there is such a thing as a free lunch, and it's called AMP.
www.gilesthomas.com/2026/04/llm-...
Weight tying, by contrast with weight decay, was actually really easy! It didn't help, though :-(
www.gilesthomas.com/2026/03/llm-...
Weight tying, by contrast with weight decay, was actually really easy! It didn't help, though :-(
www.gilesthomas.com/2026/03/llm-...
Weight decay is conceptually simpler than I worried it might be, but still pretty fiddly to get right...
www.gilesthomas.com/2026/03/llm-...
Weight decay is conceptually simpler than I worried it might be, but still pretty fiddly to get right...
www.gilesthomas.com/2026/03/llm-...
Learning rates for LLMs turned out to be a deep topic, but making some tweaks certainly seems to help my base model train: www.gilesthomas.com/2026/03/llm-...
Learning rates for LLMs turned out to be a deep topic, but making some tweaks certainly seems to help my base model train: www.gilesthomas.com/2026/03/llm-...
Now this was a surprise. QKV bias is not meant to be useful -- but with my GPT-2 small model, it looks like it is! www.gilesthomas.com/2026/02/llm-...
Now this was a surprise. QKV bias is not meant to be useful -- but with my GPT-2 small model, it looks like it is! www.gilesthomas.com/2026/02/llm-...
Does removing dropout improve our baseline model's test loss? Yes, absolutely, and much more than gradient clipping did.
www.gilesthomas.com/2026/02/llm-...
Does removing dropout improve our baseline model's test loss? Yes, absolutely, and much more than gradient clipping did.
www.gilesthomas.com/2026/02/llm-...
First "intervention" test: does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.
www.gilesthomas.com/2026/02/llm-...
First "intervention" test: does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.
www.gilesthomas.com/2026/02/llm-...
Back to my LLM from scratch series. I want to train the *best* GPT-2-style model that I can locally in two days, and there are various levers to pull. Working out which ones work means I need a baseline for comparison.
www.gilesthomas.com/2026/02/llm-...
Back to my LLM from scratch series. I want to train the *best* GPT-2-style model that I can locally in two days, and there are various levers to pull. Working out which ones work means I need a baseline for comparison.
www.gilesthomas.com/2026/02/llm-...
I wanted to get a custom LLM up onto the @hf.co Hub, and couldn't find an in-depth tutorial. Here's the one I wish I'd found before I got started: www.gilesthomas.com/2026/01/cust...
I thought it would be good to share the base models I've been training on Hugging Face, and now they are :-)
www.gilesthomas.com/2026/01/llm-...
I thought it would be good to share the base models I've been training on Hugging Face, and now they are :-)
www.gilesthomas.com/2026/01/llm-...
I wanted to dig into why the results I got on instruction fine-tuning for each of my models didn't seem to match up well with the loss on the test set. Got some interesting results: www.gilesthomas.com/2026/01/2026...
I wanted to dig into why the results I got on instruction fine-tuning for each of my models didn't seem to match up well with the loss on the test set. Got some interesting results: www.gilesthomas.com/2026/01/2026...
Having trained a GPT-2 scale base model from scratch in 48 hours locally, I wanted to see if I could do the same faster and at a reasonable cost in the cloud. I could! www.gilesthomas.com/2026/01/llm-...
Having trained a GPT-2 scale base model from scratch in 48 hours locally, I wanted to see if I could do the same faster and at a reasonable cost in the cloud. I could! www.gilesthomas.com/2026/01/llm-...
I managed to train my own base model from scratch on an RTX 3090! Very detailed notes here: www.gilesthomas.com/2025/12/llm-...