Giles Thomas (@gilesthomas.com) Bsky

Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud Now that I've tried a number of interventions into my model and training run, and some of them seem to improve the model, how do we stack them together, and what are the results?

After spending two months trying out different interventions on my GPT-2-style model, it was time to try stacking them up. Interesting results!

www.gilesthomas.com/2026/04/llm-...

2 days ago 0 1 0 0

Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud Now that I've tried a number of interventions into my model and training run, and some of them seem to improve the model, how do we stack them together, and what are the results?

After spending two months trying out different interventions on my GPT-2-style model, it was time to try stacking them up. Interesting results!

www.gilesthomas.com/2026/04/llm-...

2 days ago 0 1 0 0

Writing an LLM from scratch, part 32i -- Interventions: what is in the noise? How much of the variation in my training runs is real signal, and how much is random noise? I trained seven more models to find out.

I wanted to see whether my results when testing interventions to my GPT-2-style training loop were signal or noise. The results were promising!

www.gilesthomas.com/2026/04/llm-...

4 days ago 0 1 1 0

Writing an LLM from scratch, part 32i -- Interventions: what is in the noise? How much of the variation in my training runs is real signal, and how much is random noise? I trained seven more models to find out.

I wanted to see whether my results when testing interventions to my GPT-2-style training loop were signal or noise. The results were promising!

www.gilesthomas.com/2026/04/llm-...

4 days ago 0 1 1 0

Writing an LLM from scratch, part 32h -- Interventions: full fat float32 I've been using PyTorch's Automated Mixed Precision (AMP) and lower-precision matrix multiplication logic for my training runs so far for larger batch sizes and faster training. Does doing that lead ...

My final intervention test, in which I discover that there is such a thing as a free lunch, and it's called AMP.

www.gilesthomas.com/2026/04/llm-...

1 week ago 0 1 1 0

Writing an LLM from scratch, part 32h -- Interventions: full fat float32 I've been using PyTorch's Automated Mixed Precision (AMP) and lower-precision matrix multiplication logic for my training runs so far for larger batch sizes and faster training. Does doing that lead ...

My final intervention test, in which I discover that there is such a thing as a free lunch, and it's called AMP.

www.gilesthomas.com/2026/04/llm-...

1 week ago 0 1 1 0

Writing an LLM from scratch, part 32g -- Interventions: weight tying Weight tying is apparently not used in modern LLMs, and intuitively would worsen performance in general. Does it?

Weight tying, by contrast with weight decay, was actually really easy! It didn't help, though :-(

www.gilesthomas.com/2026/03/llm-...

2 weeks ago 0 1 1 0

Writing an LLM from scratch, part 32g -- Interventions: weight tying Weight tying is apparently not used in modern LLMs, and intuitively would worsen performance in general. Does it?

Weight tying, by contrast with weight decay, was actually really easy! It didn't help, though :-(

www.gilesthomas.com/2026/03/llm-...

2 weeks ago 0 1 1 0

Writing an LLM from scratch, part 32f -- Interventions: weight decay What is weight decay, and what is the right value to set it to in order to get the best possible training run for our model?

Weight decay is conceptually simpler than I worried it might be, but still pretty fiddly to get right...

www.gilesthomas.com/2026/03/llm-...

2 weeks ago 0 1 1 0

Writing an LLM from scratch, part 32f -- Interventions: weight decay What is weight decay, and what is the right value to set it to in order to get the best possible training run for our model?

Weight decay is conceptually simpler than I worried it might be, but still pretty fiddly to get right...

www.gilesthomas.com/2026/03/llm-...

2 weeks ago 0 1 1 0

Writing an LLM from scratch, part 32e -- Interventions: the learning rate The learning rate is an essential hyperparameter for training models with gradient descent, but what actual values should you use, and what does it mean to schedule it?

Learning rates for LLMs turned out to be a deep topic, but making some tweaks certainly seems to help my base model train: www.gilesthomas.com/2026/03/llm-...

1 month ago 1 1 1 0

Writing an LLM from scratch, part 32e -- Interventions: the learning rate The learning rate is an essential hyperparameter for training models with gradient descent, but what actual values should you use, and what does it mean to schedule it?

Learning rates for LLMs turned out to be a deep topic, but making some tweaks certainly seems to help my base model train: www.gilesthomas.com/2026/03/llm-...

1 month ago 1 1 1 0

Writing an LLM from scratch, part 32d -- Interventions: adding attention bias Having bias terms for the query, key, and value attention weights is apparently no longer used because it doesn't help. Let's check that it really doesn't for our model!

Now this was a surprise. QKV bias is not meant to be useful -- but with my GPT-2 small model, it looks like it is! www.gilesthomas.com/2026/02/llm-...

2 months ago 1 1 1 0

Writing an LLM from scratch, part 32d -- Interventions: adding attention bias Having bias terms for the query, key, and value attention weights is apparently no longer used because it doesn't help. Let's check that it really doesn't for our model!

Now this was a surprise. QKV bias is not meant to be useful -- but with my GPT-2 small model, it looks like it is! www.gilesthomas.com/2026/02/llm-...

2 months ago 1 1 1 0

Writing an LLM from scratch, part 32c -- Interventions: removing dropout Does removing dropout improve our baseline model's test loss? Yes, absolutely, and even more than gradient clipping did.

Does removing dropout improve our baseline model's test loss? Yes, absolutely, and much more than gradient clipping did.

www.gilesthomas.com/2026/02/llm-...

2 months ago 0 1 1 0

Writing an LLM from scratch, part 32c -- Interventions: removing dropout Does removing dropout improve our baseline model's test loss? Yes, absolutely, and even more than gradient clipping did.

Does removing dropout improve our baseline model's test loss? Yes, absolutely, and much more than gradient clipping did.

www.gilesthomas.com/2026/02/llm-...

2 months ago 0 1 1 0

Writing an LLM from scratch, part 32b -- Interventions: gradient clipping Does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.

First "intervention" test: does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.

www.gilesthomas.com/2026/02/llm-...

2 months ago 1 1 1 0

Writing an LLM from scratch, part 32b -- Interventions: gradient clipping Does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.

First "intervention" test: does adding gradient clipping improve our baseline model by lessening the loss spikes during training? It does, but it turned out to be more of a rabbit hole than I expected.

www.gilesthomas.com/2026/02/llm-...

2 months ago 1 1 1 0

Writing an LLM from scratch, part 32a -- Interventions: training a baseline model I want to try a bunch of interventions like gradient clipping and removing dropout to see if my models get better. I need a baseline train without them so that I can be sure of the results.

Back to my LLM from scratch series. I want to train the *best* GPT-2-style model that I can locally in two days, and there are various levers to pull. Working out which ones work means I need a baseline for comparison.

www.gilesthomas.com/2026/02/llm-...

2 months ago 0 1 1 0

Writing an LLM from scratch, part 32a -- Interventions: training a baseline model I want to try a bunch of interventions like gradient clipping and removing dropout to see if my models get better. I need a baseline train without them so that I can be sure of the results.

Back to my LLM from scratch series. I want to train the *best* GPT-2-style model that I can locally in two days, and there are various levers to pull. Working out which ones work means I need a baseline for comparison.

www.gilesthomas.com/2026/02/llm-...

2 months ago 0 1 1 0

Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer) A worked example of packaging a from-scratch GPT-2-style model for the Hugging Face Hub so it loads via from_pretrained, runs with pipeline, and trains with Trainer -- with notes on tokeniser gotchas.

I wanted to get a custom LLM up onto the @hf.co Hub, and couldn't find an in-depth tutorial. Here's the one I wish I'd found before I got started: www.gilesthomas.com/2026/01/cust...

2 months ago 0 0 0 0

Writing an LLM from scratch, part 31 -- the models are now on Hugging Face I've trained seven models using the GPT-2 architecture: let's share them on Hugging Face!

I thought it would be good to share the base models I've been training on Hugging Face, and now they are :-)

www.gilesthomas.com/2026/01/llm-...

2 months ago 1 1 1 0

Writing an LLM from scratch, part 31 -- the models are now on Hugging Face I've trained seven models using the GPT-2 architecture: let's share them on Hugging Face!

I thought it would be good to share the base models I've been training on Hugging Face, and now they are :-)

www.gilesthomas.com/2026/01/llm-...

2 months ago 1 1 1 0

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results I was unhappy with the LLM-as-a-judge instruction fine-tuning results I got when comparing my various base models. Could I make them any better?

I wanted to dig into why the results I got on instruction fine-tuning for each of my models didn't seem to match up well with the loss on the test set. Got some interesting results: www.gilesthomas.com/2026/01/2026...

3 months ago 0 1 1 0

Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results I was unhappy with the LLM-as-a-judge instruction fine-tuning results I got when comparing my various base models. Could I make them any better?

I wanted to dig into why the results I got on instruction fine-tuning for each of my models didn't seem to match up well with the loss on the test set. Got some interesting results: www.gilesthomas.com/2026/01/2026...

3 months ago 0 1 1 0

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud Having trained a base model from scratch on my own machine over 48 hours, I wanted to make it faster by training with multiple GPUs in the cloud.

Having trained a GPT-2 scale base model from scratch in 48 hours locally, I wanted to see if I could do the same faster and at a reasonable cost in the cloud. I could! www.gilesthomas.com/2026/01/llm-...

3 months ago 0 1 1 0

Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud Having trained a base model from scratch on my own machine over 48 hours, I wanted to make it faster by training with multiple GPUs in the cloud.

Having trained a GPT-2 scale base model from scratch in 48 hours locally, I wanted to see if I could do the same faster and at a reasonable cost in the cloud. I could! www.gilesthomas.com/2026/01/llm-...

3 months ago 0 1 1 0

Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090 I felt like it should be possible to train a GPT-2 small level model on my own hardware using modern tools and open datasets from scratch. It was!

I managed to train my own base model from scratch on an RTX 3090! Very detailed notes here: www.gilesthomas.com/2025/12/llm-...

4 months ago 0 1 1 0

Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090 I felt like it should be possible to train a GPT-2 small level model on my own hardware using modern tools and open datasets from scratch. It was!

I managed to train my own base model from scratch on an RTX 3090! Very detailed notes here: www.gilesthomas.com/2025/12/llm-...

4 months ago 0 1 1 0

Writing an LLM from scratch, part 27 -- what's left, and what's next? Having finished the main body of 'Build an LLM (from scratch)', it's time to think about what I need to do to treat this project as fully done

So, what's left to do in my series on building an LLM from scratch? And what follow-up series should I work on? Some musings: www.gilesthomas.com/2025/11/llm-...

5 months ago 0 1 1 0

Posts by Giles Thomas