MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU
#CUDA #LLM #Package
hgpu.org?p=30722
Posts by HGPU group
CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe
#CUDA #LLM #MachineLearning #ML
hgpu.org?p=30720
Agentic Code Optimization via Compiler-LLM Cooperation
#LLM #CodeGeneration #Package
hgpu.org?p=30719
DVM: Real-Time Kernel Generation for Dynamic AI Models
#LLM #CodeGeneration #AI #Package
hgpu.org?p=30718
DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation
#Triton #CUDA #LLM
hgpu.org?p=30706
Mixed-precision numerics in scientific applications: survey and perspectives
#GPU #MixedPrecision #Review
hgpu.org?p=30704
AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
#CUDA #Triton #Package
hgpu.org?p=30703
Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context
#Triton #ROCm #DeepLearning #Package
hgpu.org?p=30696
MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?
#CUDA #LLM #CodeGeneration
hgpu.org?p=30695
SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
#CUDA #Triton #Benchmarking #Package
hgpu.org?p=30694
Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study
#AMD #LLM #Benchmarking
hgpu.org?p=30693
True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity
#Precision #CNN #Package
hgpu.org?p=30680
KernelFoundry: Hardware-aware evolutionary GPU kernel optimization
#CUDA #SYCL #LLM
hgpu.org?p=30679
An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU
#Triton #NVIDIA #AMD #LLM
hgpu.org?p=30678
KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization
#CUDA #LLM #Performance #Package
hgpu.org?p=30665
AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU
#CUDA #LLM
hgpu.org?p=30663
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
#LLM #AI #Package
hgpu.org?p=30662
Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4
#LLM #FP4 #NVFP4 #MXFP4 #Precision #AMD #NVIDIA
hgpu.org?p=30661
CONCUR: Benchmarking LLMs for Concurrent Code Generation
#CodeGeneration #LLM #Package
hgpu.org?p=30644
RepoLaunch: Automating Build & Test Pipeline of Code Repositories on ANY Language and ANY Platform
#LLM #Package
hgpu.org?p=30643
Ray Tracing using HIP
#HIP #AMD #Raytracing #Rendering #Package
hgpu.org?p=30642
Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent
#Chemistry #LLM #Catalyst
hgpu.org?p=30641
Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs
#CUDA #LLM #Hopper #FP4 #Precision #Package
hgpu.org?p=30640
CUDABench: Benchmarking LLMs for Text-to-CUDA Generation
#CUDA #LLM #Benchmarking #Package
hgpu.org?p=30630
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
#CUDA #CodeGeneration #LLM
hgpu.org?p=30629