HGPU group (@hgpu) Bsky

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stor…

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

#CUDA #LLM #Package

hgpu.org?p=30722

1 week ago 1 0 1 0

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training rec…

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

#CUDA #Triton

hgpu.org?p=30721

1 week ago 0 0 0 0

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between…

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

#CUDA #LLM #MachineLearning #ML

hgpu.org?p=30720

1 week ago 1 0 0 0

Agentic Code Optimization via Compiler-LLM Cooperation Generating performant executables from high level languages is critical to software performance across a wide range of domains. Modern compilers perform this task by passing code through a series o…

Agentic Code Optimization via Compiler-LLM Cooperation

#LLM #CodeGeneration #Package

hgpu.org?p=30719

1 week ago 0 0 0 0

DVM: Real-Time Kernel Generation for Dynamic AI Models Dynamism is common in AI computation, e.g., the dynamic tensor shapes and the dynamic control flows in models. Due to the long compilation time, existing runtime compilation damages the model effic…

DVM: Real-Time Kernel Generation for Dynamic AI Models

#LLM #CodeGeneration #AI #Package

hgpu.org?p=30718

1 week ago 0 0 0 0

DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch refer…

DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

#Triton #CUDA #LLM

hgpu.org?p=30706

3 weeks ago 0 0 0 0

High-level Programming of Vulkan-based GPUs Through OpenMP Modern applications often involve complex, structured or data-parallel computations on large datasets. Traditionally, GPUs have served as the primary accelerators for such tasks, mostly through com…

High-level Programming of Vulkan-based GPUs Through OpenMP

#OpenMP #Vulkan

hgpu.org?p=30705

3 weeks ago 0 0 0 0

Mixed-precision numerics in scientific applications: survey and perspectives The explosive demand for artificial intelligence (AI) workloads has led to a significant increase in silicon area dedicated to lower-precision computations on recent high-performance computing hard…

Mixed-precision numerics in scientific applications: survey and perspectives

#GPU #MixedPrecision #Review

hgpu.org?p=30704

3 weeks ago 2 3 0 0

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agen…

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

#CUDA #Triton #Package

hgpu.org?p=30703

3 weeks ago 0 1 0 0

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context Memory access errors remain one of the most pervasive bugs in GPU programming. Existing GPU sanitizers such as compute-sanitizer detect memory access errors by instrumenting every memory instructio…

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

#Triton #ROCm #DeepLearning #Package

hgpu.org?p=30696

4 weeks ago 1 0 0 0

MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices? Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet their potential for generating kernels specifically for mobile devices remains largely unexplored. In …

MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

#CUDA #LLM #CodeGeneration

hgpu.org?p=30695

4 weeks ago 0 0 0 0

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity t…

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

#CUDA #Triton #Benchmarking #Package

hgpu.org?p=30694

4 weeks ago 0 0 0 0

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study We present a cross-architecture evaluation of production LLM inference on AMD Instinct MI325X GPUs, benchmarking four models spanning 235B to 1 trillion parameters across three architectural famili…

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

#AMD #LLM #Benchmarking

hgpu.org?p=30693

4 weeks ago 1 0 0 0

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low mem…

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

#CUDA #LLM #Package

hgpu.org?p=30692

4 weeks ago 0 0 0 0

Hunting CUDA Bugs at Scale with cuFuzz GPUs play an increasingly important role in modern software. However, the heterogeneous host-device execution model and expanding software stacks make GPU programs prone to memory-safety and concur…

Hunting CUDA Bugs at Scale with cuFuzz

#CUDA #Package

hgpu.org?p=30681

1 month ago 0 0 0 0

True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity Low-precision neural network training has emerged as a promising direction for reducing computational costs and democratizing access to deep learning research. However, existing 4-bit quantization …

True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity

#Precision #CNN #Package

hgpu.org?p=30680

1 month ago 0 0 0 0

KernelFoundry: Hardware-aware evolutionary GPU kernel optimization Optimizing GPU kernels presents a significantly greater challenge for large language models (LLMs) than standard code generation tasks, as it requires understanding hardware architecture, parallel …

KernelFoundry: Hardware-aware evolutionary GPU kernel optimization

#CUDA #SYCL #LLM

hgpu.org?p=30679

1 month ago 0 1 0 0

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU Fine-tuning Large Language Models (LLMs) has become essential for domain adaptation, but its memory-intensive property exceeds the capabilities of most GPUs. To address this challenge and democrati…

An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU

#Triton #NVIDIA #AMD #LLM

hgpu.org?p=30678

1 month ago 1 0 0 0

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing L…

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

#CUDA #LLM #Performance #Package

hgpu.org?p=30665

1 month ago 1 0 0 0

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, cu…

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

#CUDA #Package

hgpu.org?p=30664

1 month ago 0 0 0 0

AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU Large language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning-action loops, interleaving model computation with external calls. Unlike traditional chat applica…

AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU

#CUDA #LLM

hgpu.org?p=30663

1 month ago 1 0 0 0

EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery The increasing adoption of Large Language Models (LLMs) has enabled AI scientists to perform complex end-to-end scientific discovery tasks requiring coordination of specialized roles, including ide…

EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery

#LLM #AI #Package

hgpu.org?p=30662

1 month ago 1 0 0 0

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4 Quantization addresses the high resource demand for large language models (LLMs) by alleviating memory pressure and bandwidth congestion and providing significantly scaled compute power with a tole…

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

#LLM #FP4 #NVFP4 #MXFP4 #Precision #AMD #NVIDIA

hgpu.org?p=30661

1 month ago 0 1 0 0

CONCUR: Benchmarking LLMs for Concurrent Code Generation Leveraging Large Language Models (LLMs) for code generation has increasingly emerged as a common practice in the domain of software engineering. Relevant benchmarks have been established to evaluat…

CONCUR: Benchmarking LLMs for Concurrent Code Generation

#CodeGeneration #LLM #Package

hgpu.org?p=30644

1 month ago 0 0 0 0

RepoLaunch: Automating Build & Test Pipeline of Code Repositories on ANY Language and ANY Platform Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We intro…

RepoLaunch: Automating Build & Test Pipeline of Code Repositories on ANY Language and ANY Platform

#LLM #Package

hgpu.org?p=30643

1 month ago 0 0 0 0

Ray Tracing using HIP In this technical report, we introduce the basics of ray tracing and explain how to accelerate the computation of the rendering algorithm in HIP. We also show how to use a HIP ray tracing framework…

Ray Tracing using HIP

#HIP #AMD #Raytracing #Rendering #Package

hgpu.org?p=30642

1 month ago 0 0 0 0

Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experiment…

Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent

#Chemistry #LLM #Catalyst

hgpu.org?p=30641

1 month ago 1 0 0 0

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs Training large-scale Mixture-of-Experts (MoE) models is bottlenecked by activation memory and expert-parallel communication, yet FP4 training remains impractical on Hopper-class GPUs without native…

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

#CUDA #LLM #Hopper #FP4 #Precision #Package

hgpu.org?p=30640

1 month ago 0 0 0 0

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation Recent studies have demonstrated the potential of Large Language Models (LLMs) in generating GPU Kernels. Current benchmarks focus on the translation of high-level languages into CUDA, overlooking …

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

#CUDA #LLM #Benchmarking #Package

hgpu.org?p=30630

1 month ago 0 0 0 0

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side setti…

StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning

#CUDA #CodeGeneration #LLM

hgpu.org?p=30629

1 month ago 0 0 0 0

Posts by HGPU group