Advertisement · 728 × 90

Posts by Dhruv Rawat

4. Low-Latency Interconnect: Reducing latency with directions like:

- high-connectivity topologies (tree, dragonfly), that require fewer hops
- in-network acceleration of communication collectives (broadcast, all-reduce) used by LLMs
- AI chip optimization
- codesigning reliability and interconnect

2 months ago 0 0 0 0

3. 3D memory-logic stacking: Stacking memory and logic layers vertically (TSVs) to get a wide-and-dense memory interface for high bandwidth at low power. This is a version of PNM.

2 months ago 0 0 1 0
Post image

2. Processing-Near-Memory (PNM): Moving computation closer to where data is stored (but separate dies) to overcome bandwidth limitations. They clearly distinguish it from Processing-in-Memory (PIM), in which the processor and memory are on the same die.

2 months ago 0 0 1 0

1. High Bandwidth Flash (HBF): Developing flash storage that offers 10x the memory capacity while maintaining bandwidth comparable to HBM. This enables new capabilities for LLM inference, such as 10x weight memory, 10x context memory, a smaller inference system, and greater resource capacity.

2 months ago 0 0 1 0
Preview
Challenges and Research Directions for Large Language Model Inference Hardware Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI tr...

David Patterson & Xiaoyu Ma (Google) have written a new paper in which they argue why LLM inference is not just about compute but has its own distinct hardware challenges, while proposing four promising research directions.

arxiv.org/abs/2601.05047

2 months ago 9 2 1 1