Lepton AI’s Post

View organization page for Lepton AI, graphic

1,787 followers

2mo

Founder @ Lepton AI | Berkeley alumni | Cloud & Open-source AI leadership

2mo

People often ask why prices like $2.8/m token for Llama 405B, while being super fast, are still profitable at Lepton AI. We've even been asked by a leading GPU provider! So, I figured we should share some technical analysis. This information could benefit the community. We've taken these statistics and analysis for granted, but they might not be obvious to everyone. 1. Big batches: Each request receives an output of ~30 tokens/second. Batching (grouping multiple requests simultaneously) significantly improves total throughput, often 10x or higher than a single request. GPUs are more efficient with larger batches. 2. Dynamic batching: This technique immediately adds a new request to an existing batch instead of making it wait, ensuring the GPU always works at high capacity. 3. Input tokens: The ~30 tokens/second refers to output tokens. Input tokens are processed much faster (known as "prefilling"). Typically, the input length is many times larger than the output (3x to 10x). This increases the total number of tokens processed, explaining why there is often separate billing for input and output. 4. Quantization: Using 8-bit integers or 8-bit floats instead of 16-bit floats reduces memory usage and speeds up processing because the GPU accesses less memory. Newer GPUs also have hardware instructions for lower bit numbers, increasing speed further. For example, the new Nvidia Blackwell GPU supports 4-bit floats (fp4). Quantization also saves memory, allowing even bigger batches from point 1, making it more economic. 5. Speculative decoding: This method uses a smaller model to predict the next token. For example, predicting "you" after "it is good to see" doesn't require a large model. Smaller models make such predictions faster. The Medusa algorithm by Tianle Cai is a specific example of this approach. 6. Prompt caching: LLMs often encounter repeated prefixes, such as "you are a smart AI agent" in system prompts. Caching these prefilled prompts avoids recalculating them, speeding up repeated requests. 7. Optimizing GPU setups: This involves using large GPUs for big models, small GPUs for small models, and matching GPUs to specific tasks—some are better for prefilling, others for decoding. There are many optimization opportunities here. This is not a complete list. We integrate these methods (and a growing number of more) in our runtime to ensure profitability with reasonable traffic. Lepton is created by experts who have developed key AI software over the past decade - Caffe, onnx, pytorch - alongside cloud experts like the creator of etcd and core contributors to Kubernetes. We provide not only LLM APIs, but also a full cloud-native experience to help you find, use, and optimize GPUs on our cloud platform. We love the open-source and open-access community. What AI technical explanation would you like to hear next?

To view or add a comment, sign in

More Relevant Posts

Lepton AI

1,787 followers
1mo
Report this post
Navigating the complexities of the GPU market can be daunting. We understand the need for a clear and comprehensive guide to make informed decisions. To learn more about how we help our customers run H100: https://lnkd.in/gDPkKytc

The Missing Guide to the H100 GPU Market

blog.lepton.ai
Like Comment
To view or add a comment, sign in
Lepton AI

1,787 followers
1mo
Report this post
We're open-sourcing GPUd, an AI-native GPU management utility that reduces GPU cluster unavailability by 4x. Built by experts from Meta, Alibaba, and Uber, GPUd automates monitoring, diagnostics, and issue identification for GPUs—ideal for cloud or on-premise, at any scale. To learn more about what GPUd is, the blog post is available at: https://lnkd.in/gAxG-ec8
2 Comments
Like Comment
To view or add a comment, sign in
Lepton AI

1,787 followers
2mo
Report this post
"The only reason to give a speech is to change the world." We are bring native, real-time voice generation to all open source LLMs. Instead of separate, duck-taped modules, we build one single engine to deliver both the text and audio stream with time to first audio around 300 milliseconds or lower. More specifically, our engine: Seamless integration with any major open-source LLM, including Llama3.1-8B, 70B, and 405B Outpaces traditional text-to-speech workflows with up to 10x faster TTFA (Time to first audio) Delivers smooth, customizable dialogue, minimizing pauses and interruptions Fully customizable voice profiles. To learn more about how we did it, the blog post is at https://lnkd.in/g5c4F48M We are currently opening this up to existing Lepton AI customers and beta users, and will make it generally available soon. Shoot us an email at info@lepton.ai if you would like early access! Here's how it works:
Like Comment
To view or add a comment, sign in
Lepton AI

1,787 followers
2mo
Report this post
Yangqing Jia

Founder @ Lepton AI | Berkeley alumni | Cloud & Open-source AI leadership
2mo

# Llama3 405B, API, Quantization, and Model Size Performance measurements of Llama3.1 405B, orchestrated from OpenRouter , one of the leading LLM aggregation platforms. Here are my couple cents of the model: - It's amazing to see the quick support of the model by almost all providers. Open source makes software and model co-development much easy. In our case, it took us minimal python code change to support it (like, minutes). - Llama 3.1 405B is indeed a model hard to make profitable. Taking half a machine or a machine to run, its cost is significant and speed is still so-so. Most providers keep it around 30 tokens/s (see pic) to make economic sense. In comparison, 70B models can go north of 150 tokens/s. - You will still be able to break even. Of course, this is dependent on a good optimization, and a good workload saturation. To our VC friends: for pure API service at this price tag, kindly not expect an 80% profit margin like conventional SaaS though. - In addition to top performance optimization, the Lepton AI API makes conscious balances between the many parameters - speed, price, concurrency, cost - to make sure that it is sustainable. - Quantization is going to be a standard. Folks, forget about FP16. Int8/FP8 is the way to go. If you still feel uncomfortable, let me tell you that back in the days AI frameworks worried about precision and still supports FP64. Have you ever used FP64 in your neural nets? - Quantization needs care. Gone are the days when one scale is enough for the whole tensor. You'll need to do channel wise / group wise quantization to make sure things do not degrade. - My bold prediction is that, 405B adoption will still be limited by the speed and price constraint. But I am not much worried, as I expect at least another 4x efficiency improvement over the next year or so. - I am looking forward to testing out Mistral Large 123B! Our Tuna engine supports it out of box, although to honor the research license, we'll refrain from hosting a public API. If you are interested, let us know. - Andrej Karpathy has an awesome tweet about small models FTW. I totally agree. In vertical applications you probably don't need models that big. 70B is normally good enough, and in many cases 8B is really good with finetuning! - It's great that llama 3.1 allows (and in some way recommends) finetuning your own model. - I also want to give a shout out to the vLLM project. We have our own engine but vLLM is simply great. Our platform supports it too. Last but not least, public API is one thing but feel free to reach out to us for enterprise / dedicated deployments. We believe that AI is awesome beyond APIs, and we build a full AI cloud to serve your end to end needs.
Like Comment
To view or add a comment, sign in
Lepton AI

1,787 followers
2mo
Report this post
Yangqing Jia

Founder @ Lepton AI | Berkeley alumni | Cloud & Open-source AI leadership
2mo

Memory Matters for LLM While everyone is rushing to provide the serverless Llama3-405b model endpoints, I want to talk about one key choice that matters a lot, especially for dedicated enterprise deployments when traffic is not very high: memory. - The normal deployment of a model the size of 405b takes 8xH100 GPUs with a total of 640G memory. You'll quantize the weights to int8 or fp8, leaving about 230G memory for KV cache and others. Doable with care. - If you need to do fine-tuning (full fine-tuning, or Lora, or Medusa), memory size is going to be stressful. Your choices are probably (1) do quantized training with careful control of scale, (2) go distributed, both require extra care. - AMD MI-300 is a particularly interesting card for this scenario, as each card has 192G memory - 4 cards with a total of 768G memory will be very comfortably host the model, as well as giving you a good amount of remaining memory for KV cacheing / prompt cacheing and other tricks. - Attached is a screenshot showing our runtime ("tuna") running the 405b model on 4xMI300 out of the box at Lepton AI. Speed is good. - We know there are a lot of claims out there saying one is faster than the other, but based on our experience, with reasonable quantization, continuous batching, chunked decoding and other known optimization techniques, MI300 and H100 exhibit on-par performance. - We haven't thoroughly tested some of the optimization techniques, such as Medusa, on the 405b models. So it is hard to say for sure which GPU takes the lead. - The upcoming Blackwell GPUs will have 192G memory as well, so we are definitely seeing appetite for larger models. - Large memory definitely gives you opportunity to do more within one box: 1.536TB memory per machine means you can do almost whatever you want with the 405b sized models: fine-tune them, serve multiple models at once, hot swap Loras, etc. It's exciting times for model, and also exciting times for infra. (This is a re-post of my twitter post here: https://lnkd.in/gj7s5xET )
Like Comment
To view or add a comment, sign in
Lepton AI

1,787 followers
3mo
Report this post
Song Han

Assoc. Prof. @MIT, distinguished scientist @NVIDIA, co-founder of DeePhi (now part of AMD) and OmniML(now part of NVIDIA). PhD @Stanford. Efficient AI Computing.
4mo

DistriFusion: multi-GPU parallel diffusion model acceleration at CVPR poster #232, highlight poster
Like Comment
To view or add a comment, sign in
Lepton AI

1,787 followers
7mo
Report this post
🚀 Excited to introduce the future of machine learning on Lepton AI! We've teamed up with Google's latest marvel, Gemma, to bring you an API that's as powerful as it is user-friendly. 🤘 #MachineLearning #AI #Gemma Start exploring with Gemma today at https://lnkd.in/gjk8WqTp

Gemma 7b | Lepton AI Playground

lepton.ai
Like Comment
To view or add a comment, sign in

1,787 followers

View Profile Follow

Lepton AI’s Post

More Relevant Posts

Explore topics