# Llama3 405B, API, Quantization, and Model Size
Performance measurements of Llama3.1 405B, orchestrated from OpenRouter
, one of the leading LLM aggregation platforms. Here are my couple cents of the model:
- It's amazing to see the quick support of the model by almost all providers. Open source makes software and model co-development much easy. In our case, it took us minimal python code change to support it (like, minutes).
- Llama 3.1 405B is indeed a model hard to make profitable. Taking half a machine or a machine to run, its cost is significant and speed is still so-so. Most providers keep it around 30 tokens/s (see pic) to make economic sense. In comparison, 70B models can go north of 150 tokens/s.
- You will still be able to break even. Of course, this is dependent on a good optimization, and a good workload saturation. To our VC friends: for pure API service at this price tag, kindly not expect an 80% profit margin like conventional SaaS though.
- In addition to top performance optimization, the Lepton AI API makes conscious balances between the many parameters - speed, price, concurrency, cost - to make sure that it is sustainable.
- Quantization is going to be a standard. Folks, forget about FP16. Int8/FP8 is the way to go. If you still feel uncomfortable, let me tell you that back in the days AI frameworks worried about precision and still supports FP64. Have you ever used FP64 in your neural nets?
- Quantization needs care. Gone are the days when one scale is enough for the whole tensor. You'll need to do channel wise / group wise quantization to make sure things do not degrade.
- My bold prediction is that, 405B adoption will still be limited by the speed and price constraint. But I am not much worried, as I expect at least another 4x efficiency improvement over the next year or so.
- I am looking forward to testing out Mistral Large 123B! Our Tuna engine supports it out of box, although to honor the research license, we'll refrain from hosting a public API. If you are interested, let us know.
- Andrej Karpathy has an awesome tweet about small models FTW. I totally agree. In vertical applications you probably don't need models that big. 70B is normally good enough, and in many cases 8B is really good with finetuning!
- It's great that llama 3.1 allows (and in some way recommends) finetuning your own model.
- I also want to give a shout out to the vLLM project. We have our own engine but vLLM is simply great. Our platform supports it too.
Last but not least, public API is one thing but feel free to reach out to us for enterprise / dedicated deployments. We believe that AI is awesome beyond APIs, and we build a full AI cloud to serve your end to end needs.
Tech Community Builder / Tech Consultant / Tech Sales / Transformation Maps / Ex-TCS / Ex-Lenovo / AI / DATA / Startups & VCs / EVs / Software Engineering / Product Management /
8moWow Congrats , This is incredible 🚀