🚀 Introducing Machete: A New Mixed-Input GEMM Kernel for NVIDIA Hopper GPUs 🚀
We’re excited to unveil Machete, a major step forward in high-performance LLM inference. By focusing on w4a16 mixed-input quantization, Machete reduces memory usage by ~4x, making deployments significantly more efficient in memory-bound regimes. While compute-bound performance remains in line with FP16, Machete truly excels in optimizing memory bandwidth for GPTQ-style models. 🧠
Key highlights of Machete:
- Built on CUTLASS 3.x, utilizing wgmma tensor core instructions, overcoming limitations in compute-bound scenarios.
- Weight pre-shuffling for faster shared memory loads and reduced bottlenecks in large-scale LLMs.
- 128-bit shared memory loads for high throughput and further reduced latency.
Optimized upconversion routines to maximize tensor core utilization by converting 4-bit elements to 16-bit efficiently.
With Machete, we’ve achieved a 29% faster input and 32% faster output token throughput on Llama 3.1 70B, with a TTFT of <250ms on a single H100 GPU.
And that’s not all... On a 4xH100 setup, Machete delivers a 42% throughput speedup on Llama 3.1 405B—with more optimizations on the way, including support for w4a8 FP8, AWQ, QQQ, and low-batch-size performance.
🎉 A huge shoutout to Lucas Wilkinson for leading the development of Machete and the team at NVIDIA AI for continual support! Special thanks to 3Blue1Brown and the Manim community for the amazing animations that helped visualize these optimizations.
Read the full blog here: https://lnkd.in/ggKYbmKR
#AI #LLMs #GPUs #NVIDIA #vLLM #DeepLearning #MachineLearning