Pascal Biese’s Post

Daily AI highlights for 60k+ experts 📲🤗 AI/ML Engineer

10mo

Speeding up Mixtral by 4x with FireAttention 🔥 FireAttention is a custom CUDA kernel from Fireworks AI that is optimized for Multi-Query Attention models. Benchmarked on the popular Mixtral-8x7B, their new FP8 variant is leading to ~4x faster inference compared to vLLM FP16. In 2024, we will probably see more kernel and even chip level advancements that typically are taking longer to realize than the innovations we'e seen last year. The caveat for the Open Source community is that lower level soft- and hardware are also more defensible and will thus often remain proprietary. Nevertheless, I'm excited to see what this year will have in store for us. [Blog] https://lnkd.in/e4f__A7U

4 Comments

Daniel Svonava

Vector Compute @ Superlinked | xYouTube

10mo

Techniques like FP8 quantization balancing speed/accuracy will be key as models continue growing in scale. Curious what applications might benefit most from these gains.

5 Reactions

Petr Kazar

CTO / Chief Architect at ABIS Czech ⚡️ Interested in AI research

10mo

But a customized CUDA framework/driver is not a limitation to open-sourcing, as vGPU is transparent to the guest OS (at least for NVIDIA). It's just a business-driven decision. This limitation would apply to e.g. Linux Containers / LXC, not to VMs.

2 Reactions

Berk Gökden

AI Engineer | Data Engineer | Machine Learning Engineer | Leading Data-Driven Solutions for Optimal Business Outcomes

10mo

4x sounds fantastic. We may see Linux kernels optimized for ML models soon.

3 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Michael Ryaboy

AI Developer Advocate | Vector DBs | Full-Stack Development
3mo
Report this post
Just read a mind-blowing blog from Fireworks AI on LLM quantization. Here are my key takeaways: 1. Quantization often doesn't reduce output quality 2. It's an easy performance boost 3. Has minimal impact on MMLU scores 4. How to properly evaluate quantized LLMs: KL divergence > task-based metrics Expose logprobs for community analysis 5. Fireworks AI: Great serving option Rumored to use AMD GPUs (cost-effective, more memory, and therefore larger context windows.) I'm pretty sure this is why Fireworks' pricing blows everyone else's out of the water. Based on this, I think most models served in production should be quantized in some way, but there doesn't yet seem to be a one-size-fits-all quantization technique yet. Full blog here: https://lnkd.in/gP5RGimB
6 Comments
Like Comment
To view or add a comment, sign in
Fred von Graf

Co-Founder, CEO & Investor
3mo
Report this post
We’ve seen minimal to no impact in quality and obvious improvements in performance and cost for inference when running quantized models including Llama 3.1 at InternPro.ai Then consider the crazy performance and memory you have available when using clusters from TensorWave with the inference engine from MK1 and running fine-tuned model using Predibase you have the perfect solution. Price, performance and privacy for your data. There is no reason to use OpenAI… Although after reading this we’ll be exploring using Fireworks AI 🚀
Michael Ryaboy

AI Developer Advocate | Vector DBs | Full-Stack Development
3mo

Just read a mind-blowing blog from Fireworks AI on LLM quantization. Here are my key takeaways: 1. Quantization often doesn't reduce output quality 2. It's an easy performance boost 3. Has minimal impact on MMLU scores 4. How to properly evaluate quantized LLMs: KL divergence > task-based metrics Expose logprobs for community analysis 5. Fireworks AI: Great serving option Rumored to use AMD GPUs (cost-effective, more memory, and therefore larger context windows.) I'm pretty sure this is why Fireworks' pricing blows everyone else's out of the water. Based on this, I think most models served in production should be quantized in some way, but there doesn't yet seem to be a one-size-fits-all quantization technique yet. Full blog here: https://lnkd.in/gP5RGimB
Like Comment
To view or add a comment, sign in
David Cuellar Rueda

| Democratizing AI | GPT-DAVE | Azure AI Certified | Big Tech & Multi-Cloud Fan |
3mo
Report this post
Quantized Models are Production Ready 📊✅
Michael Ryaboy

AI Developer Advocate | Vector DBs | Full-Stack Development
3mo

Just read a mind-blowing blog from Fireworks AI on LLM quantization. Here are my key takeaways: 1. Quantization often doesn't reduce output quality 2. It's an easy performance boost 3. Has minimal impact on MMLU scores 4. How to properly evaluate quantized LLMs: KL divergence > task-based metrics Expose logprobs for community analysis 5. Fireworks AI: Great serving option Rumored to use AMD GPUs (cost-effective, more memory, and therefore larger context windows.) I'm pretty sure this is why Fireworks' pricing blows everyone else's out of the water. Based on this, I think most models served in production should be quantized in some way, but there doesn't yet seem to be a one-size-fits-all quantization technique yet. Full blog here: https://lnkd.in/gP5RGimB
Like Comment
To view or add a comment, sign in
Robert DeMartino

Kinetica | The Speed Layer for Generative AI and Real-Time Analytics
7mo Edited
Report this post
Now here’s a revolutionary idea …. Watch this snippet from MadMoney “SQL processing with GPUs” Kinetica is the GPUdatabase for analytics and generative AI with real-time RAG. GPU ready today, battle tested (in gov), HA all here now with ansi compliant SQL, vector, graph and NL2SQL processing working on Nvidia’s RAPIDS, NeMo and NIM.
Like Comment
To view or add a comment, sign in
Marcel Marais

AI @ Flank
5mo
Report this post
There’s a new Raspberry Pi AI Kit for only $70. This extremely cheap hardware lets you run neural networks for object detection, semantic and instance segmentation, pose estimation, and more. The main value prop here is that the camera comes pre-integrated with the AI framework so you don’t have to write any drivers. Some specs: - 13 tera-operations per second (TOPS) of inference performance - Single-lane PCIe 3.0 connection running at 8Gbps - Full integration with the Raspberry Pi image software subsystem - Compatibility with first-party or third-party cameras - Efficient scheduling of the accelerator hardware: run multiple neural networks on a single camera, or single/multiple neural networks with two cameras concurrently
3 Comments
Like Comment
To view or add a comment, sign in
Mike Hall

Technical leader - Software Engineer - AI, Cloud, APIs, Data
3mo
Report this post
New Image AI model was just released, results are very impressive Looks to be on par with state of the art - mid journey v6 Open weights, non commercial licence. Released by Black Forest Labs - Flex.1 You can download it here - https://lnkd.in/gbJmPUkN You will need a enterprise GPU to run this at any reasonable speed, but a 12gb consumer grade GPU will run the q8 version, at around 1min per image based on what I have read. The image is from here --> https://lnkd.in/g99jC26H Many more example images in the above link
2 Comments
Like Comment
To view or add a comment, sign in
Antonio Zarauz Moreno

Cognitive-AI Tech Lead @Credicorp
6mo
Report this post
Nvidia just publised a curated collection of utilities to build QA-RAG like solutions, including datasets and models weights. It is particularly important to point out the multi-turn encoder, that highlights the importance of short term memory during information retrieval step. Instead of using your LLM (or a smaller version) to reformulate your query, simply ingest your whole short-term memory in the encoder. Not only does it drastically reduce latency, but it also seems to be a much more natural approach. Kudos to NVIDIA AI team! https://lnkd.in/dRj3eVMU

Llama3-ChatQA-1.5 - a nvidia Collection

huggingface.co
Like Comment
To view or add a comment, sign in
Kaniz Fatma

Open to New Roles | Technical Solutions Engineer | Technical Product Architect | Data Engineer | Data Scientist | Adjunct Faculty | Biologist | Heart For Healthcare | B1 Visa Holder | Open to H1B Sponsorship
9mo
Report this post
Quantization is a powerful technique that can significantly improve machine learning model speed and throughput. Check out this blog to explore two different quantization setups, and the benefits of using both👇

Serving Quantized LLMs on NVIDIA H100 Tensor Core GPUs

databricks.com
Like Comment
To view or add a comment, sign in
Curzio Trezzani

Senior Account Executive @Databricks
9mo
Report this post
Quantization is a powerful technique that can significantly improve machine learning model speed and throughput. Check out this blog to explore two different quantization setups, and the benefits of using both👇

Serving Quantized LLMs on NVIDIA H100 Tensor Core GPUs

databricks.com
Like Comment
To view or add a comment, sign in
Ashish Upadhyay

Director, Field Engineering at Databricks
9mo
Report this post
Quantization is a powerful technique that can significantly improve machine learning model speed and throughput. Check out this blog to explore two different quantization setups, and the benefits of using both👇

Serving Quantized LLMs on NVIDIA H100 Tensor Core GPUs

databricks.com
Like Comment
To view or add a comment, sign in

67,168 followers

View Profile Follow

Pascal Biese’s Post

More from this author

🤏 All You Need to Know About Small Language Models

🧠 Is AI Capable of Reflection?

🗃️ GraphRAG Evolves into StructRAG

Explore topics