Underfox (@underfox3) Bsky

The results show that AQPIM achieves a drastic reduction in GPU-CPU communication, which can account for 90∼98.5% of decoding latency, together with a 3.4× speedup over a SOTA PIM approach due to the aggressively reduced KV cache size.

5 hours ago 1 0 0 0

AQPIM transforms general matrix-vector multiplication operation in attention into a sequence of efficient lookups and summations that operate directly on the compressed data, eliminating the need for dequantization and allowing the use of existing simple FP16 MAC units.

5 hours ago 1 0 1 0

In this paper is proposed AQPIM, a novel PIM-aware activation quantization framework based on product quantization, optimized for modern LLMs, which enables practical, high-fidelity online clustering-based quantization by leveraging PIM massive bandwidth.

arxiv.org/pdf/2604.18137

5 hours ago 1 0 1 0

The results with three LLM models with eleven different sizes on three GPU platforms with a CXL-expanded memory show that HybridGen outperforms six state-of-the-art KV cache management methods by 1.41x–3.2x on average while main- taining superior accuracy.

6 hours ago 2 0 0 0

HybridGen decouples attention logits from the attention pipeline and enables the CPU to proactively compute the next layer’s attention using the current layer’s input, leveraging similarities between the inputs of consecutive transformer layers.

6 hours ago 1 0 1 0

In this paper is proposed HybridGen, an efficient hybrid attention framework for long-context LLM inference, enabling CPU–GPU collaborative attention on systems with CXL memory.

arxiv.org/pdf/2604.18529

6 hours ago 1 0 1 0

Under equivalent performance targets in the decode setting, it further delivers up to 1.93x and 2.72x higher power efficiency over the baseline NPU and H100, respectively.

1 day ago 1 0 0 0

The experimental results show that, under the same power budget for agentic workloads, MemExplorer achieves up to 2.3x higher energy efficiency than the baseline NPU and 3.23x over H100 in the prefill-only setting.

1 day ago 1 0 1 0

MemExplorer unifies software optimizations, including dataflow strategy and storage scheduling, with NPU compute and memory configurations into a single co-design space, enabling joint optimization across the NPU architecture, its memory hierarchy, and SW execution strategies.

1 day ago 0 0 1 0

In this paper is presented MemExplorer, a new memory system synthesizer for heterogeneous NPU systems which captures the intrinsic relationship between compute and multi-level memory hierarchies in LLM inference NPUs.

arxiv.org/pdf/2604.16007

1 day ago 1 0 1 0

"When new fact acquisition is required, self-distillation constrains output-distribution drift and reduces forgetting from ∼15% to ∼3% without sacrificing factual plasticity."

1 day ago 1 0 0 0

"This method reframes SFT-induced hallucinations as factual forgetting arising from continual learning dynamics. In this way, when new fact acquisition is undesirable, selectively freezing FFN parameters suppresses factual plasticity while preserving task learning."

1 day ago 1 0 1 0

In this paper is proposed a self-distillation–based supervised fine-tuning method that facilitates effective factual learning while minimizing hallucinations.

arxiv.org/pdf/2604.15574

1 day ago 1 0 1 0

It further proposes a frequency-aware densification strategy that utilizes a decomposed frequency-energy map to clone primitives within a specific frequency range.

1 day ago 1 0 0 0

In this paper is introduced neural Gabor splatting, a new representation that equips each Gaussian primitive with a lightweight MLP, enabling a single primitive to capture complex local color patterns depending on the view direction.

arxiv.org/pdf/2604.15941

1 day ago 1 0 1 0

Deployed across two exascale supercomputers, the proposed architecture attains a peak performance of 1.2/1.0 EFLOPS (24%/35.5% of theoretical peak) in single precision at over 90% parallel efficiency, compressing the training of billion-parameter uMLIPs from weeks to hours.

1 day ago 1 0 0 0

In this paper is presented MatRIS-MoE, an invariant, attention-based uMLIP designed for highly efficient atomistic modeling for exascale, and Janus, a pioneering high-dimensional distributed training framework for uMLIPs with hardware-aware optimizations.

arxiv.org/pdf/2604.15821

1 day ago 1 0 1 0

The resulting model enables the unprecedented evaluation of 1.1 billion structures in 50 seconds, accelerating materials discovery by enabling rapid exploration of vast chemical design spaces.

1 day ago 2 0 0 1

The researchers jointly train on 16 open first-principles datasets (over 544 million structures spanning more than 85 elements) using a multi-tasking architecture with headers per dataset and a scalable ADIOS2/DDStore data pipeline.

1 day ago 1 0 1 0

In this paper, researchers have presented the first exascale multi-task workflow for atomistic graph foundation models, trained on 544+ million structures from 16 datasets using 16,384 GPUs on Frontier supercomputer.

arxiv.org/pdf/2604.15380

1 day ago 3 0 1 1

"At larger batch sizes, cooperative weight tiling increases L2 hit rate (from 12% to 54% at batch size 32 and from 39% to 61% at batch size 64), reducing HBM traffic by up to 37% and delivering 1.27–1.30x speedup over a chiplet-unaware megakernel baseline."

1 day ago 1 0 0 0

"On AMD Instinct MI350 with Qwen3-8B, Fleet achieves 1.3–1.5x lower decode latency than vLLM at batch sizes 1–8 through persistent kernel execution and per-chiplet scheduling."

1 day ago 1 0 1 0

The key new abstraction is the Chiplet-task, which enables the programmer to express behavior that is usefully scoped to chiplet boundaries and the corresponding L2 memory hierarchy. In this work, CU tasks and Chiplet tasks are used for compute operations.

1 day ago 0 0 1 0

4 - Device-task, comprised of eight Chiplet-tasks, collectively executes one operator across the entire device.

When compared with CUDA/HIP, wavefront-tasks, CU-tasks, and device-tasks roughly correspond to familiar concepts of wavefronts, workgroups, and grids by design.

1 day ago 0 0 1 0

2 - CU-tasks occupy a single compute unit, communicating through the L1/LDS hierarchy;

3 - Chiplet-tasks span all workers on a chiplet, coordinating data access through the XCD-local L2 cache;

1 day ago 0 0 1 0

In Fleet, an application workload is partitioned into tasks with explicit dependencies between them. Tasks are scoped to different levels of the memory hierarchy:

1 - Wavefront-tasks operate within a single wavefront using registers and LDS.

1 day ago 0 0 1 0

Fleet is implemented as a persistent kernel runtime with per-chiplet scheduling, allowing workers within a chiplet to cooperatively execute tasks with coordinated cache reuse.

1 day ago 0 0 1 0

AMD researchers have proposed Fleet, a multi-level task model that maps megakernel computation to memory scopes, introducing a new abstraction that binds work and data to a chiplet and enables coordination through its shared L2 cache.

arxiv.org/pdf/2604.15379

1 day ago 2 0 1 0

These results, together with the structural scalability, confirm the strong potential of VCF for reliable high-density FeRAM integration.

3 days ago 1 0 0 0

This device also achieves robust retention (> 10⁴ s at 85 °C) and strong disturb immunity, with 2Pr > 80 μC/cm² under a V/3 scheme after 10⁶ disturb pulses.

3 days ago 1 0 1 0

Posts by Underfox