[Scale ML] Lucas Wilkinson: Machete: a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs

Speaker

Lucas Wilkinson

Neural Magic

Host

Scale ML

Speaker: Lucas Wilkinson
Topic: Machete: a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs
Date: Wednesday, December 4
Time: 3:00 PM (ET)
Zoom: https://mit.zoom.us/j/91697262920 (password: mitmlscale)

Abstract
Weight-only quantization (w4a16) offers significant GPU memory reduction for LLM inference while accelerating memory-bound linear layers during generation. These benefits require specialized mixed-input GEMM kernels that can de-quantize weights and perform matrix multiplication on the fly. This talk introduces Machete, a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs. After reviewing modern GPU architecture fundamentals, we'll explore optimizations specific to mixed-input GEMMs, particularly weight pre-shuffling. We'll demonstrate how Machete leverages CUTLASS and CuTe layouts to exploit new Hopper GPU features—namely the tensor memory accelerator (TMA) and WGMMA instructions—achieving superior w4a16 model performance compared to Marlin kernels in high-load serving scenarios.

Bio

Lucas Wilkinson is a Principal HPC Engineer at Neural Magic, specializing in the development of novel kernels for quantized and sparse matrix multiplication. He holds an MSc in Computer Science from the University of Toronto, where he focused on efficient sparse neural network inference. His areas of interest include maximizing hardware utilization, sparse linear algebra, and code generation.

Add to Calendar 2024-12-04 15:00:00 2024-12-04 16:00:00 America/New_York [Scale ML] Lucas Wilkinson: Machete: a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs Speaker: Lucas WilkinsonTopic: Machete: a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUsDate: Wednesday, December 4Time: 3:00 PM (ET)Zoom: https://mit.zoom.us/j/91697262920 (password: mitmlscale)AbstractWeight-only quantization (w4a16) offers significant GPU memory reduction for LLM inference while accelerating memory-bound linear layers during generation. These benefits require specialized mixed-input GEMM kernels that can de-quantize weights and perform matrix multiplication on the fly. This talk introduces Machete, a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs. After reviewing modern GPU architecture fundamentals, we'll explore optimizations specific to mixed-input GEMMs, particularly weight pre-shuffling. We'll demonstrate how Machete leverages CUTLASS and CuTe layouts to exploit new Hopper GPU features—namely the tensor memory accelerator (TMA) and WGMMA instructions—achieving superior w4a16 model performance compared to Marlin kernels in high-load serving scenarios.BioLucas Wilkinson is a Principal HPC Engineer at Neural Magic, specializing in the development of novel kernels for quantized and sparse matrix multiplication. He holds an MSc in Computer Science from the University of Toronto, where he focused on efficient sparse neural network inference. His areas of interest include maximizing hardware utilization, sparse linear algebra, and code generation. 45-792

[Scale ML] Lucas Wilkinson: Machete: a cutting-edge mixed-input GEMM GPU kernel targeting NVIDIA Hopper GPUs

Speaker

Host

December 04 2024

Location