Pascal Biese’s Post

View profile for Pascal Biese, graphic

Daily AI highlights for 60k+ experts 📲🤗 AI/ML Engineer

Speeding up Mixtral by 4x with FireAttention 🔥 FireAttention is a custom CUDA kernel from Fireworks AI that is optimized for Multi-Query Attention models. Benchmarked on the popular Mixtral-8x7B, their new FP8 variant is leading to ~4x faster inference compared to vLLM FP16. In 2024, we will probably see more kernel and even chip level advancements that typically are taking longer to realize than the innovations we'e seen last year. The caveat for the Open Source community is that lower level soft- and hardware are also more defensible and will thus often remain proprietary. Nevertheless, I'm excited to see what this year will have in store for us. [Blog] https://lnkd.in/e4f__A7U

  • No alternative text description for this image
Daniel Svonava

Vector Compute @ Superlinked | xYouTube

10mo

Techniques like FP8 quantization balancing speed/accuracy will be key as models continue growing in scale. Curious what applications might benefit most from these gains.

Petr Kazar

CTO / Chief Architect at ABIS Czech ⚡️ Interested in AI research

10mo

But a customized CUDA framework/driver is not a limitation to open-sourcing, as vGPU is transparent to the guest OS (at least for NVIDIA). It's just a business-driven decision. This limitation would apply to e.g. Linux Containers / LXC, not to VMs.

Berk Gökden

AI Engineer | Data Engineer | Machine Learning Engineer | Leading Data-Driven Solutions for Optimal Business Outcomes

10mo

4x sounds fantastic. We may see Linux kernels optimized for ML models soon.

See more comments

To view or add a comment, sign in

Explore topics