The results show that AQPIM achieves a drastic reduction in GPU-CPU communication, which can account for 90∼98.5% of decoding latency, together with a 3.4× speedup over a SOTA PIM approach due to the aggressively reduced KV cache size.
Posts by Underfox
AQPIM transforms general matrix-vector multiplication operation in attention into a sequence of efficient lookups and summations that operate directly on the compressed data, eliminating the need for dequantization and allowing the use of existing simple FP16 MAC units.
In this paper is proposed AQPIM, a novel PIM-aware activation quantization framework based on product quantization, optimized for modern LLMs, which enables practical, high-fidelity online clustering-based quantization by leveraging PIM massive bandwidth.
arxiv.org/pdf/2604.18137
The results with three LLM models with eleven different sizes on three GPU platforms with a CXL-expanded memory show that HybridGen outperforms six state-of-the-art KV cache management methods by 1.41x–3.2x on average while main- taining superior accuracy.
HybridGen decouples attention logits from the attention pipeline and enables the CPU to proactively compute the next layer’s attention using the current layer’s input, leveraging similarities between the inputs of consecutive transformer layers.
In this paper is proposed HybridGen, an efficient hybrid attention framework for long-context LLM inference, enabling CPU–GPU collaborative attention on systems with CXL memory.
arxiv.org/pdf/2604.18529
Under equivalent performance targets in the decode setting, it further delivers up to 1.93x and 2.72x higher power efficiency over the baseline NPU and H100, respectively.
The experimental results show that, under the same power budget for agentic workloads, MemExplorer achieves up to 2.3x higher energy efficiency than the baseline NPU and 3.23x over H100 in the prefill-only setting.
MemExplorer unifies software optimizations, including dataflow strategy and storage scheduling, with NPU compute and memory configurations into a single co-design space, enabling joint optimization across the NPU architecture, its memory hierarchy, and SW execution strategies.
In this paper is presented MemExplorer, a new memory system synthesizer for heterogeneous NPU systems which captures the intrinsic relationship between compute and multi-level memory hierarchies in LLM inference NPUs.
arxiv.org/pdf/2604.16007
"When new fact acquisition is required, self-distillation constrains output-distribution drift and reduces forgetting from ∼15% to ∼3% without sacrificing factual plasticity."
"This method reframes SFT-induced hallucinations as factual forgetting arising from continual learning dynamics. In this way, when new fact acquisition is undesirable, selectively freezing FFN parameters suppresses factual plasticity while preserving task learning."
In this paper is proposed a self-distillation–based supervised fine-tuning method that facilitates effective factual learning while minimizing hallucinations.
arxiv.org/pdf/2604.15574
It further proposes a frequency-aware densification strategy that utilizes a decomposed frequency-energy map to clone primitives within a specific frequency range.
In this paper is introduced neural Gabor splatting, a new representation that equips each Gaussian primitive with a lightweight MLP, enabling a single primitive to capture complex local color patterns depending on the view direction.
arxiv.org/pdf/2604.15941
Deployed across two exascale supercomputers, the proposed architecture attains a peak performance of 1.2/1.0 EFLOPS (24%/35.5% of theoretical peak) in single precision at over 90% parallel efficiency, compressing the training of billion-parameter uMLIPs from weeks to hours.
In this paper is presented MatRIS-MoE, an invariant, attention-based uMLIP designed for highly efficient atomistic modeling for exascale, and Janus, a pioneering high-dimensional distributed training framework for uMLIPs with hardware-aware optimizations.
arxiv.org/pdf/2604.15821
The resulting model enables the unprecedented evaluation of 1.1 billion structures in 50 seconds, accelerating materials discovery by enabling rapid exploration of vast chemical design spaces.
The researchers jointly train on 16 open first-principles datasets (over 544 million structures spanning more than 85 elements) using a multi-tasking architecture with headers per dataset and a scalable ADIOS2/DDStore data pipeline.
In this paper, researchers have presented the first exascale multi-task workflow for atomistic graph foundation models, trained on 544+ million structures from 16 datasets using 16,384 GPUs on Frontier supercomputer.
arxiv.org/pdf/2604.15380
"At larger batch sizes, cooperative weight tiling increases L2 hit rate (from 12% to 54% at batch size 32 and from 39% to 61% at batch size 64), reducing HBM traffic by up to 37% and delivering 1.27–1.30x speedup over a chiplet-unaware megakernel baseline."
"On AMD Instinct MI350 with Qwen3-8B, Fleet achieves 1.3–1.5x lower decode latency than vLLM at batch sizes 1–8 through persistent kernel execution and per-chiplet scheduling."
The key new abstraction is the Chiplet-task, which enables the programmer to express behavior that is usefully scoped to chiplet boundaries and the corresponding L2 memory hierarchy. In this work, CU tasks and Chiplet tasks are used for compute operations.
4 - Device-task, comprised of eight Chiplet-tasks, collectively executes one operator across the entire device.
When compared with CUDA/HIP, wavefront-tasks, CU-tasks, and device-tasks roughly correspond to familiar concepts of wavefronts, workgroups, and grids by design.
2 - CU-tasks occupy a single compute unit, communicating through the L1/LDS hierarchy;
3 - Chiplet-tasks span all workers on a chiplet, coordinating data access through the XCD-local L2 cache;
In Fleet, an application workload is partitioned into tasks with explicit dependencies between them. Tasks are scoped to different levels of the memory hierarchy:
1 - Wavefront-tasks operate within a single wavefront using registers and LDS.
Fleet is implemented as a persistent kernel runtime with per-chiplet scheduling, allowing workers within a chiplet to cooperatively execute tasks with coordinated cache reuse.
AMD researchers have proposed Fleet, a multi-level task model that maps megakernel computation to memory scopes, introducing a new abstraction that binds work and data to a chiplet and enables coordination through its shared L2 cache.
arxiv.org/pdf/2604.15379
These results, together with the structural scalability, confirm the strong potential of VCF for reliable high-density FeRAM integration.
This device also achieves robust retention (> 10⁴ s at 85 °C) and strong disturb immunity, with 2Pr > 80 μC/cm² under a V/3 scheme after 10⁶ disturb pulses.