Baseten’s Post

7,509 followers

6mo

What precision format do you use for LLM serving? 🤔 LLMs have billions of parameters that translate to billions of numbers needing to be stored, read, and processed when they're run. FP16 has been a common default format, but it's increasingly common to serve LLMs using FP8—and for good reasons. FP8 can massively improve inference speed and decrease operational costs, with less output quality degradation compared to other techniques. 💡 Learn more about FP8 quantization in Philip Kiely's article: https://lnkd.in/eKvQzsni Tell us: what precision formats do you use for your models? 🧮

To view or add a comment, sign in

More Relevant Posts

Saša Meden

Operations Support at Integration Matters d.o.o. Croatia, Python dev. - Big Data Analytics, AI engineering, AI agents Game dev. Android dev. Java dev. DB - SQL, MySQL, Postgre, Linux - Penetration Tester,OSINT
1mo
Report this post
You want to run some of these big models locally? Quantization Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like an 8-bit integer (int8) instead of the usual 32-bit floating point (float32). Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows running models on embedded devices, which sometimes only support integer data types. You will need some knowledge, but you have tools at your fingertips. You have other LLMs you can use to accomplish this, even if you don't know how to code. (Reading generated code, and understanding it helps a lot)
Like Comment
To view or add a comment, sign in
Neural Magic (Acquired by Red Hat)

18,242 followers
7mo
Report this post
FP8 quantization is now available in vLLM - check it out! Quantized inference is one of the best ways to reduce the costs of LLM deployments.
Anyscale

45,025 followers
7mo

We’ve recently contributed FP8 support to vLLM in collaboration with Neural Magic -- with this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! A common concern with FP8 is whether users will experience accuracy degradation. To address this, Neural Magic has produced many checkpoints for key models with >99% accuracy preservation across a wide range of benchmarks (https://lnkd.in/gTimN5dZ), including: - Llama3-70b - Mixtral 8x7b - Llama3-8b You can easily try this out on vLLM, and read more about the feature here -- https://lnkd.in/gzKJqerB
Like Comment
To view or add a comment, sign in
Prashant Shihora

AI Developer @ NYU | Data Scientist and AI/ML Engineer | Ex-LTIMindtree | Current Master’s Student at NYU | Machine Learning | Generative AI (LLM) | Natural Language Processing | Cloud
4mo Edited
Report this post
The shift toward efficient on-device inference is probably a major change that the industry needs as it adapts to new demands in AI technology. It will not only solve computational problems but also enhance privacy. The real innovation is maintaining accuracy while reducing computational demands. It’s truly impressive what Meta has been doing; they were also among the first to democratize LLMs and release a strong open-source model, LLaMA, which gave the open-source community the boost it needed. The community and its tools have evolved because they had a significant toy to play with (the LLaMA model), leading to many great results with popular frameworks like llama.cpp and Text Generation WebUI—those who have been working in the LLM landscape might remember this as the first tool that helped people run LLMs on laptops. And now, with this level of quantization, we are truly changing the entire workflow.
Amit Sangani
4mo

Do you need an LLM that has fast on-device inference, accuracy, and portability? If your answer is "yes", we just released quantized versions of 1B and 3B models with increased speed and a reduced memory footprint. We designed the current quantization scheme with PyTorch’s ExecuTorch inference framework. https://lnkd.in/gbhcnXWt
2 Comments
Like Comment
To view or add a comment, sign in
Saurabh Kumar

Engineering @Adora | Prev. Rapyuta(ML), Yahoo(ML), Nokia | IIT Delhi
6mo
Report this post
Knowing float point precision is essential in understanding how quantization works in ML and why you shouldn't quantize the model hastily. The more the precision the bigger the memory needed to store them and fine-tune them so reducing precision without losing information becomes challenging. There are various techniques for quantizing ML models, but understanding why you need quantization and how much information you lose when you go from FP32 to FP16 is essential. For example in this image below 7.567856 is represented as a signed 32-bit but when quantized to 16 bits the information available is just 7.566, meaning we lot the rest of the FP values. Doing this for billions of weights can lead to severe performance degradation.
4 Comments
Like Comment
To view or add a comment, sign in
Simon Villani, PhD

Lead Data Scientist at ANZ | AI and LLM Enthusiast | AI Mentor
8mo
Report this post
This is a fantastic (if not obvious) result. Using LLMs together in an ensemble outperforms single models. Paper below: https://lnkd.in/ghSWhyGn
4 Comments
Like Comment
To view or add a comment, sign in
Matthew Groff

Principal AI Engineer @ Umbrage, part of Bain & Company | AI Capability Lead
1mo
Report this post
The DeepSeek R-1 distilled Qwen 32B quantized to FP4 (say that name three times fast) that I can run in my computer with a single RTX 4090 is better than GPT 3.5/4 which were state of the art just a couple years ago. My vibes check doesn’t have this beating GPT-4o and o1 yet and Claude 3.5 is a better coding model, but that doesn’t change how incredible it is to have access to a model of such high caliber running offline entirely in my machine, no data center needed. It’s slower and worse performing than the Llama 70B distilled version on Groq but they have data centers filled with custom LPU chips so it’s not quite as fair. My best use case so far for these thinking models is Tool/Function Calling. They perform very well at this even the smaller versions of the model.

1 Comment
Like Comment
To view or add a comment, sign in
Wesley Mmadike

FrontEnd software developer | Telecommunication analytics
8mo
Report this post
this is actually very impressive

Mpho Mokomiri Ephraim Shiang

Founder & Group Chief Technology Officer, Titans Corporate
8mo Edited

Hi, Using a camera to control your laptop with machine learning – no mouse needed! Control it with your face and hands. Building Intelligent Systems for Tomorrow #MachineLearning #ComputerVision #CNN

1 Comment
Like Comment
To view or add a comment, sign in
Bronia Badubi

Hr projects/ ERM/IOP candidate/ Gallup Certified Strength coach/ Trainer by BQA/ Positive institutions consultant
8mo
Report this post
Organisations and businesses, it’s time we embrace machine learning for our talent acquisition strategy, especially for recruitment and selection and learning and development, gen Z skills are more diverse. It’s critical that we embrace these diverse skills and use to retain our gen Z …gone are the days where organisations use classroom training with trainers, gamification and machine learning are good ways of embracing the change in learning and development. #learninganddevelopment #talentaquisition #inclusion #generationZ #embracediversity

Mpho Mokomiri Ephraim Shiang

Founder & Group Chief Technology Officer, Titans Corporate
8mo Edited

Hi, Using a camera to control your laptop with machine learning – no mouse needed! Control it with your face and hands. Building Intelligent Systems for Tomorrow #MachineLearning #ComputerVision #CNN
Like Comment
To view or add a comment, sign in
Antonio Romeo

Presales Lead | Technologist | Cybersecurity & AI Enthusiast
7mo
Report this post
Interesting article on how/if/why to shrink down (quantize) you model parameters when running a llm model locally... https://lnkd.in/dXDpQYVx

Honey, I shrunk the LLM! A beginner's guide to quantization

theregister.com
Like Comment
To view or add a comment, sign in
Mpho Mokomiri Ephraim Shiang

Founder & Group Chief Technology Officer, Titans Corporate
8mo Edited
Report this post
Hi, Using a camera to control your laptop with machine learning – no mouse needed! Control it with your face and hands. Building Intelligent Systems for Tomorrow #MachineLearning #ComputerVision #CNN

515 Comments
Like Comment
To view or add a comment, sign in

7,509 followers

View Profile Follow

Baseten’s Post

More from this author

Deploying and using Stable Diffusion XL 1.0

Build a chatbot with Llama 2 and LangChain

Models We Love: July 2023

Explore topics