Lepton AI’s Post

View organization page for Lepton AI, graphic

1,787 followers

View profile for Yangqing Jia, graphic

Founder @ Lepton AI | Berkeley alumni | Cloud & Open-source AI leadership

Memory Matters for LLM While everyone is rushing to provide the serverless Llama3-405b model endpoints, I want to talk about one key choice that matters a lot, especially for dedicated enterprise deployments when traffic is not very high: memory. - The normal deployment of a model the size of 405b takes 8xH100 GPUs with a total of 640G memory. You'll quantize the weights to int8 or fp8, leaving about 230G memory for KV cache and others. Doable with care. - If you need to do fine-tuning (full fine-tuning, or Lora, or Medusa), memory size is going to be stressful. Your choices are probably (1) do quantized training with careful control of scale, (2) go distributed, both require extra care. - AMD MI-300 is a particularly interesting card for this scenario, as each card has 192G memory - 4 cards with a total of 768G memory will be very comfortably host the model, as well as giving you a good amount of remaining memory for KV cacheing / prompt cacheing and other tricks. - Attached is a screenshot showing our runtime ("tuna") running the 405b model on 4xMI300 out of the box at Lepton AI. Speed is good. - We know there are a lot of claims out there saying one is faster than the other, but based on our experience, with reasonable quantization, continuous batching, chunked decoding and other known optimization techniques, MI300 and H100 exhibit on-par performance. - We haven't thoroughly tested some of the optimization techniques, such as Medusa, on the 405b models. So it is hard to say for sure which GPU takes the lead. - The upcoming Blackwell GPUs will have 192G memory as well, so we are definitely seeing appetite for larger models. - Large memory definitely gives you opportunity to do more within one box: 1.536TB memory per machine means you can do almost whatever you want with the 405b sized models: fine-tune them, serve multiple models at once, hot swap Loras, etc. It's exciting times for model, and also exciting times for infra. (This is a re-post of my twitter post here: https://lnkd.in/gj7s5xET )

  • No alternative text description for this image

To view or add a comment, sign in

Explore topics