Let’s sanity check DeepSeek’s claim to train on 2048 GPUs for under 2 months, for a cost of $5.6M. It sort of checks out and sort of doesn't.
The v3 model is an MoE with 37B (out of 671B) active parameters. Let's compare to the cost of a 34B dense model. 🧵
1 year ago
10
2
1
0