Speeding up Mixtral by 4x with FireAttention 🔥 FireAttention is a custom CUDA kernel from Fireworks AI that is optimized for Multi-Query Attention models. Benchmarked on the popular Mixtral-8x7B, their new FP8 variant is leading to ~4x faster inference compared to vLLM FP16. In 2024, we will probably see more kernel and even chip level advancements that typically are taking longer to realize than the innovations we'e seen last year. The caveat for the Open Source community is that lower level soft- and hardware are also more defensible and will thus often remain proprietary. Nevertheless, I'm excited to see what this year will have in store for us. [Blog] https://lnkd.in/e4f__A7U
But a customized CUDA framework/driver is not a limitation to open-sourcing, as vGPU is transparent to the guest OS (at least for NVIDIA). It's just a business-driven decision. This limitation would apply to e.g. Linux Containers / LXC, not to VMs.
4x sounds fantastic. We may see Linux kernels optimized for ML models soon.
Vector Compute @ Superlinked | xYouTube
10moTechniques like FP8 quantization balancing speed/accuracy will be key as models continue growing in scale. Curious what applications might benefit most from these gains.