Components of a coding agent: a little write-up on the building blocks behind coding agents, from repo context and tool use to memory and delegation
Link: magazine.sebastianraschka.com/p/components...
Posts by Sebastian Raschka (rasbt)
just finished the last chapter + all appendices last week :). It's currently going through the publishers layouting stages and will hopefully come out in a few weeks (it will be a color print this time!)
In that case you might also like the comparison feature I added :)
I put together a visual LLM Architecture Gallery that collects (~50) recent open-weight model designs in one place.
Architecture diagrams, config links, tech reports, explainers... you name it!
Hopefully useful as a reference & learning resource:
sebastianraschka.com/llm-architec...
Recorded a podcast, think it’s pretty good and comprehensive, hope you like it ;) youtu.be/EV7WhVT270Q?...
Been a while since I did an LLM architecture post. Just stumbled upon the Arcee AI Trinity Large release + technical report released yesterday and couldn't resist :)
Also added a new section to my LLM architecture comparison article with more details: magazine.sebastianraschka.com/i/168650848/20
Been pretty heads-down finishing Chapter 6 on implementing RLVR via GRPO. Just finished, and it might be my favorite chapter so far.
Code notebook: github.com/rasbt/reason...
(And it should be added to the early access soon.)
The next chapter adds stability and performance improvements to GRPO.
For the past month or so, I've been slowly working through this book by @sebastianraschka.com which theoretically and practically builds a GPT model from scratch. Highly recommended!
Ironically, I'm writing much more code by hand as a result
Ha, thanks for the kind compliment!
Ha, thanks! Happy new year to you as well!
Thanks! Is /r/machinelearning still weekend only for unless it's an arxiv article?
Uploaded my State of LLMs 2025 report for this year:
magazine.sebastianraschka.com/p/state-of-l...
I planned to just write a brief overview, but yeah, it was an eventful year so it was impossible to keep it below 7000 words :D.
This is an opinion. That's why I prefaced my post with "I think of it as this"
One of the underrated papers this year:
"Small Batch Size Training for Language Models:
When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful" (arxiv.org/abs/2507.07101)
(I can confirm this holds for RLVR, too! I have some experiments to share soon.)
I agree. I was thinking of “faster” because it frees time when letting it do boilerplate stuff. And I was thinking of “better” as in using it to find issues that were accidentally overlooked.
Yeah. My point was that LLMs are good amplifiers, but they are not the only tool one should use and learn from.
It's a cycle: Coding manually, reading resources written by experts, looking at high-quality projects built by experts, getting advice from experts, and repeat...
I think of it as this: LLMs lower the barrier of entry, and they make coders (beginners and experts) more productive.
It's still worth investing in becoming an expert, because then you will get even more out of LLMs and will be able to deliver even better results.
I discuss the more historical building blocks here if you are interested (going back to "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Neural Networks" 1991 by Schmidhuber): magazine.sebastianraschka.com/p/understand...
Yes yes. This is not a complete history.
I assume you are specifically referring to the first line “202x…”? I merely wanted to say that the focus in the early 2020s was more on pre-training than anything else then. (I think the term LLM wasn’t coined until the 175B GPT-3 model came out).
The LLM eras:
202x Pre-training (foundation)
2022 RLHF + PPO
2023 LoRA SFT
2024 Mid-Training
2025 RLVR + GRPO
2026 Inference-time scaling?
2027 Continual learning?
Actually I didn’t change any of the earlier sections but just appended the new sections to the article.
Re your LLM idea, I could see it as a benchmark for agentic LLMs though to see if they can get the correct architecture info from the code bases.
Just updated the Big LLM Architecture Comparison article...
...it grew quite a bit since the initial version in July 2025, more than doubled!
magazine.sebastianraschka.com/p/the-big-ll...
Based on the naming resemblance, if I had to guess, DeepSeekMoE was motivated by DeepSpeed-MoE (arxiv.org/abs/2201.05596) 14 Jan 2022
Tbh if it took them a month to write and release the paper, the DeepSeekMoE team probably also had the model ready in December.
Or in other words, I don't think they trained the model in just a month with all the ablation studies in that paper.
They don't have a reasoning model, yet. So, it is a bit unfair to compare, but since you asked:
I think Google originally came up with MoE, and DeepSeek and Mixtral adopted it independently of each other.
Eg looking at arxiv, the Mixtral report came out on 8 Jan 2024 (arxiv.org/abs/2401.04088), and DeepSeekMoE around the same time on 11 Jan 2024 (arxiv.org/abs/2401.06066)
Good catch, yes that should have been 70% not 40%. Thanks!
Hold on a sec, Mistral 3 Large uses the DeepSeek V3 architecture, including MLA?
Just went through the config files; the only difference I could see is that Mistral 3 Large used 2x fewer experts but made each expert 2x large.
Yes, good point. I must have accidentally moved the text boxes to the wrong position. Someone mentioned that on the forum last week and it's fixed now (the next time the MEAP is updated, the figures will be automatically replaced. Thanks for mentioning.