Baseten is doing some amazing work on optimizing inference with TRT-LLM and the NVIDIA H100 GPUs. Check out their platform today to see how much you can save by taking advantage of these amazing NVIDIA AI technologies.
I can't wait to see the improvements once FP8 is leveraged!
Some Highlights:
The NVIDIA Hopper architecture introduces new features and optimizations that enhance the H100's capabilities.
Using NVIDIA's TensorRT, a model serving engine, is crucial for achieving the best performance in ML inference for large language models and models like Stable Diffusion XL.
Benchmarks for Mistral 7B (Mistral AI) and Stable Diffusion XL (Stability AI) show significant performance gains (18%-45%) when using TensorRT and TensorRT-LLM for model inference on H100 GPUs.
TensorRT is essential for model inference optimization, as it builds model-specific engines that optimize individual model layers and CUDA instructions, fully leveraging the H100 GPU's new features.
Great work Pankaj Gupta, Amir Haghighat, Philip Kiely, and the rest of the Baseten team!
Farshad Saberi Movahed, PhD, Nick Comly, Julien Demouth, Daman Oberoi, Michael (Zhikui) Wang
Launching today 🎉
Double your throughput or halve your latency for Mistral AI, Stability AI + others?
Do both at ~20% lower cost with NVIDIA H100s on Baseten.
Using NVIDIA AI TensorRT, Baseten has unlocked the full potential of the H100 architecture for lightning fast inference. With TensorRT, customers get 2-3x higher inference throughput at ~20-40% lower cost than an equivalent A100.
https://lnkd.in/dBqJfeUY
Unlocking the full power of NVIDIA H100 GPUs for ML inference with TensorRT
baseten.co