Thanks CoreWeave for providing seamless integration for our solution over CoreWeave infrastructure! This demonstrates the flexibility of GPU infrastructure to adapt at great speed to innovations.
Our case study with Cerebrium is now live on our website! Together with Decart AI, Cerebrium set out to see if it was possible to get Llama 2 70B down to $0.50 per million tokens, while also keeping latency low. Achieving this feat is only possible with highly performant and cost-effective infrastructure. Cerebrium's serverless GPU infrastructure platform allows companies to scale from 0 to 10,000 requests in seconds, which translates to large cost savings compared to other platforms. Decart created an #LLM inference engine from scratch using NVIDIA's #CUDA and C++, and by leveraging NVIDIA #H100 GPUs and new versions of #Cutlass, they were able to achieve the same cost per token for Llama 2 70B as they achieved on #A100 GPUs. Take a look at our blog post to see how Decart and Cerebrium used CoreWeave infrastructure with NVIDIA hardware and software to increase throughput and decrease latency across the board. #Falcon180B and #Llama2 70B benchmarks are included in the blog as well! https://hubs.la/Q029BMrw0 #LLM #NVIDIA #H100 #GPU