Lepton AI’s Post

View organization page for Lepton AI, graphic

1,787 followers

View profile for Yangqing Jia, graphic

Founder @ Lepton AI | Berkeley alumni | Cloud & Open-source AI leadership

People often ask why prices like $2.8/m token for Llama 405B, while being super fast, are still profitable at Lepton AI. We've even been asked by a leading GPU provider! So, I figured we should share some technical analysis. This information could benefit the community. We've taken these statistics and analysis for granted, but they might not be obvious to everyone. 1. Big batches: Each request receives an output of ~30 tokens/second. Batching (grouping multiple requests simultaneously) significantly improves total throughput, often 10x or higher than a single request. GPUs are more efficient with larger batches. 2. Dynamic batching: This technique immediately adds a new request to an existing batch instead of making it wait, ensuring the GPU always works at high capacity. 3. Input tokens: The ~30 tokens/second refers to output tokens. Input tokens are processed much faster (known as "prefilling"). Typically, the input length is many times larger than the output (3x to 10x). This increases the total number of tokens processed, explaining why there is often separate billing for input and output. 4. Quantization: Using 8-bit integers or 8-bit floats instead of 16-bit floats reduces memory usage and speeds up processing because the GPU accesses less memory. Newer GPUs also have hardware instructions for lower bit numbers, increasing speed further. For example, the new Nvidia Blackwell GPU supports 4-bit floats (fp4). Quantization also saves memory, allowing even bigger batches from point 1, making it more economic. 5. Speculative decoding: This method uses a smaller model to predict the next token. For example, predicting "you" after "it is good to see" doesn't require a large model. Smaller models make such predictions faster. The Medusa algorithm by Tianle Cai is a specific example of this approach. 6. Prompt caching: LLMs often encounter repeated prefixes, such as "you are a smart AI agent" in system prompts. Caching these prefilled prompts avoids recalculating them, speeding up repeated requests. 7. Optimizing GPU setups: This involves using large GPUs for big models, small GPUs for small models, and matching GPUs to specific tasks—some are better for prefilling, others for decoding. There are many optimization opportunities here. This is not a complete list. We integrate these methods (and a growing number of more) in our runtime to ensure profitability with reasonable traffic. Lepton is created by experts who have developed key AI software over the past decade - Caffe, onnx, pytorch - alongside cloud experts like the creator of etcd and core contributors to Kubernetes. We provide not only LLM APIs, but also a full cloud-native experience to help you find, use, and optimize GPUs on our cloud platform. We love the open-source and open-access community. What AI technical explanation would you like to hear next?

To view or add a comment, sign in

Explore topics