[Scale ML] Lucas Wilkinson: Machete: a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs
Speaker
Lucas Wilkinson
Neural Magic
Host
Scale ML
Speaker: Lucas Wilkinson
Topic: Machete: a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs
Date: Wednesday, December 4
Time: 3:00 PM (ET)
Zoom: https://mit.zoom.us/j/91697262920 (password: mitmlscale)
Abstract
Weight-only quantization (w4a16) offers significant GPU memory reduction for LLM inference while accelerating memory-bound linear layers during generation. These benefits require specialized mixed-input GEMM kernels that can de-quantize weights and perform matrix multiplication on the fly. This talk introduces Machete, a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs. After reviewing modern GPU architecture fundamentals, we'll explore optimizations specific to mixed-input GEMMs, particularly weight pre-shuffling. We'll demonstrate how Machete leverages CUTLASS and CuTe layouts to exploit new Hopper GPU features—namely the tensor memory accelerator (TMA) and WGMMA instructions—achieving superior w4a16 model performance compared to Marlin kernels in high-load serving scenarios.
Bio
Lucas Wilkinson is a Principal HPC Engineer at Neural Magic, specializing in the development of novel kernels for quantized and sparse matrix multiplication. He holds an MSc in Computer Science from the University of Toronto, where he focused on efficient sparse neural network inference. His areas of interest include maximizing hardware utilization, sparse linear algebra, and code generation.
Topic: Machete: a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs
Date: Wednesday, December 4
Time: 3:00 PM (ET)
Zoom: https://mit.zoom.us/j/91697262920 (password: mitmlscale)
Abstract
Weight-only quantization (w4a16) offers significant GPU memory reduction for LLM inference while accelerating memory-bound linear layers during generation. These benefits require specialized mixed-input GEMM kernels that can de-quantize weights and perform matrix multiplication on the fly. This talk introduces Machete, a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs. After reviewing modern GPU architecture fundamentals, we'll explore optimizations specific to mixed-input GEMMs, particularly weight pre-shuffling. We'll demonstrate how Machete leverages CUTLASS and CuTe layouts to exploit new Hopper GPU features—namely the tensor memory accelerator (TMA) and WGMMA instructions—achieving superior w4a16 model performance compared to Marlin kernels in high-load serving scenarios.
Bio
Lucas Wilkinson is a Principal HPC Engineer at Neural Magic, specializing in the development of novel kernels for quantized and sparse matrix multiplication. He holds an MSc in Computer Science from the University of Toronto, where he focused on efficient sparse neural network inference. His areas of interest include maximizing hardware utilization, sparse linear algebra, and code generation.