Congrats, Kyle! Well deserved.
Posts by Quentin Anthony
Available at github.com/Quentin-Anth...
Contributions are welcome! I'll be slowly tackling roadmap items ourselves during my offtime.
I basically want functional pseudocode that students and self-learners can quickly run and play around with. How does latency increase with message size? How do collective algorithms differ? What’s the effect of warmup? Find out for yourself!
nanoMPI’s design is based on OpenMPI. The shortcomings of OpenMPI is that it’s built for a different purpose (modularity and performance), so it’s harder to get quick answers and results when needed compared to the purpose of nanoMPI (clarity and easy installation).
I consider nanoMPI to be a companion piece to conceptual MPI material (e.g. you read a description and see a visual of a ring allreduce, but what does this actually look like in code?)
nanoMPI serves the dual purpose of:
Providing a minimal implementation for HPC education
Testing distributed code on offline, local machines (I just wanna code on my laptop on a plane, not a remote HPC system)
Inspired by “minimal implementation“ projects in AI such as
@karpathy.bsky.social’s nanoGPT, I worked to bring this concept to the HPC world!
I’ve built a minimal implementation of an MPI library called nanoMPI, which focuses on clarity, simplicity, and easy installation.
We are the first to demonstrate higher training kernel throughput (both transformers and SSM hybrids) on AMD MI300X compared to H100!
- rocm.blogs.amd.com/ecosystems-a...
- www.zyphra.com/post/trainin...
C R A C K E D
We dropped the Zamba2 and Zyda2 tech reports on arxiv!
- Zamba2 models of size 1.2B, 2.7B, 7.4B
- Zyda-2 5T token dataset
- We discuss more specifics on model arch, training process, dataset creation, etc
Links:
- Zamba2: arxiv.org/abs/2411.15242
- Zyda-2: arxiv.org/abs/2411.06068
I keep coming back to interstellar: youtu.be/YF1eYbfbH5k?...