Advertisement · 728 × 90

Posts by Sebastian Raschka (rasbt)

Preview
Components of A Coding Agent How coding agents use tools, memory, and repo context to make LLMs work better in practice

Components of a coding agent: a little write-up on the building blocks behind coding agents, from repo context and tool use to memory and delegation
Link: magazine.sebastianraschka.com/p/components...

1 week ago 29 6 2 1

just finished the last chapter + all appendices last week :). It's currently going through the publishers layouting stages and will hopefully come out in a few weeks (it will be a color print this time!)

3 weeks ago 2 0 0 0
Post image

In that case you might also like the comparison feature I added :)

3 weeks ago 2 2 0 0
Preview
LLM Architecture Gallery A gallery that collects architecture figures from The Big LLM Architecture Comparison and related articles, with fact sheets and links back to the original sections.

I put together a visual LLM Architecture Gallery that collects (~50) recent open-weight model designs in one place.

Architecture diagrams, config links, tech reports, explainers... you name it!

Hopefully useful as a reference & learning resource:
sebastianraschka.com/llm-architec...

3 weeks ago 76 9 4 1
State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490
State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490 YouTube video by Lex Fridman

Recorded a podcast, think it’s pretty good and comprehensive, hope you like it ;) youtu.be/EV7WhVT270Q?...

2 months ago 41 4 2 1
Post image

Been a while since I did an LLM architecture post. Just stumbled upon the Arcee AI Trinity Large release + technical report released yesterday and couldn't resist :)

Also added a new section to my LLM architecture comparison article with more details: magazine.sebastianraschka.com/i/168650848/20

2 months ago 43 4 1 1
Post image

Been pretty heads-down finishing Chapter 6 on implementing RLVR via GRPO. Just finished, and it might be my favorite chapter so far.

Code notebook: github.com/rasbt/reason...

(And it should be added to the early access soon.)

The next chapter adds stability and performance improvements to GRPO.

3 months ago 41 4 2 0
Post image

For the past month or so, I've been slowly working through this book by @sebastianraschka.com which theoretically and practically builds a GPT model from scratch. Highly recommended!

Ironically, I'm writing much more code by hand as a result

3 months ago 17 2 2 0

Ha, thanks for the kind compliment!

3 months ago 1 0 0 0
Advertisement

Ha, thanks! Happy new year to you as well!

3 months ago 2 0 0 0

Thanks! Is /r/machinelearning still weekend only for unless it's an arxiv article?

3 months ago 0 0 1 0
Preview
The State Of LLMs 2025: Progress, Progress, and Predictions A 2025 review of large language models, from DeepSeek R1 and RLVR to inference-time scaling, benchmarks, architectures, and predictions for 2026.

Uploaded my State of LLMs 2025 report for this year:
magazine.sebastianraschka.com/p/state-of-l...

I planned to just write a brief overview, but yeah, it was an eventful year so it was impossible to keep it below 7000 words :D.

3 months ago 89 23 4 3

This is an opinion. That's why I prefaced my post with "I think of it as this"

3 months ago 0 0 0 0
Post image

One of the underrated papers this year:
"Small Batch Size Training for Language Models:
When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101)

(I can confirm this holds for RLVR, too! I have some experiments to share soon.)

3 months ago 70 9 0 1

I agree. I was thinking of “faster” because it frees time when letting it do boilerplate stuff. And I was thinking of “better” as in using it to find issues that were accidentally overlooked.

3 months ago 1 0 0 0

Yeah. My point was that LLMs are good amplifiers, but they are not the only tool one should use and learn from.

3 months ago 3 0 0 0

It's a cycle: Coding manually, reading resources written by experts, looking at high-quality projects built by experts, getting advice from experts, and repeat...

3 months ago 2 0 1 0
Post image

I think of it as this: LLMs lower the barrier of entry, and they make coders (beginners and experts) more productive.
It's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results.

3 months ago 31 3 4 3
Preview
Understanding Large Language Models A Cross-Section of the Most Relevant Literature To Get Up to Speed

I discuss the more historical building blocks here if you are interested (going back to "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Neural Networks" 1991 by Schmidhuber): magazine.sebastianraschka.com/p/understand...

3 months ago 4 0 0 0
Advertisement

Yes yes. This is not a complete history.
I assume you are specifically referring to the first line “202x…”? I merely wanted to say that the focus in the early 2020s was more on pre-training than anything else then. (I think the term LLM wasn’t coined until the 175B GPT-3 model came out).

3 months ago 2 0 2 0

The LLM eras:

202x Pre-training (foundation)
2022 RLHF + PPO
2023 LoRA SFT
2024 Mid-Training
2025 RLVR + GRPO
2026 Inference-time scaling?
2027 Continual learning?

3 months ago 35 3 1 0

Actually I didn’t change any of the earlier sections but just appended the new sections to the article.
Re your LLM idea, I could see it as a benchmark for agentic LLMs though to see if they can get the correct architecture info from the code bases.

4 months ago 1 0 0 0
Post image

Just updated the Big LLM Architecture Comparison article...
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...

4 months ago 78 13 1 0

Based on the naming resemblance, if I had to guess, DeepSeekMoE was motivated by DeepSpeed-MoE (arxiv.org/abs/2201.05596) 14 Jan 2022

4 months ago 0 0 0 0

Tbh if it took them a month to write and release the paper, the DeepSeekMoE team probably also had the model ready in December.
Or in other words, I don't think they trained the model in just a month with all the ablation studies in that paper.

4 months ago 0 0 2 0
Post image

They don't have a reasoning model, yet. So, it is a bit unfair to compare, but since you asked:

4 months ago 0 0 0 0

I think Google originally came up with MoE, and DeepSeek and Mixtral adopted it independently of each other.

Eg looking at arxiv, the Mixtral report came out on 8 Jan 2024 (arxiv.org/abs/2401.04088), and DeepSeekMoE around the same time on 11 Jan 2024 (arxiv.org/abs/2401.06066)

4 months ago 1 0 1 0
Advertisement

Good catch, yes that should have been 70% not 40%. Thanks!

4 months ago 0 0 0 0
Post image

Hold on a sec, Mistral 3 Large uses the DeepSeek V3 architecture, including MLA?

Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x large.

4 months ago 33 0 2 0

Yes, good point. I must have accidentally moved the text boxes to the wrong position. Someone mentioned that on the forum last week and it's fixed now (the next time the MEAP is updated, the figures will be automatically replaced. Thanks for mentioning.

4 months ago 1 0 1 0