Neural Magic (Acquired by Red Hat) reposted this
Check out our latest work on LLM compression and efficient training & deployment!
Excited to share our latest preprint detailing our team's recent work at LinkedIn, https://lnkd.in/dWHTuKJm! Our focus has been on training and deploying efficient Large Language Models (LLMs) across various predictive and generative applications. Through techniques like knowledge distillation, model compression via pruning and quantization, and CUDA kernel optimization, we've successfully developed and deployed small language models that mostly maintain the quality of larger foundation models while offering significantly higher inference throughput and lower latency. Notably, we've achieved over a 20x reduction in model size with minimal impact on model quality. In our paper, we discuss the specifics of our approach towards model compression and efficiency, sharing practical insights gained along the way. Our paper touches upon both methodology and practice of efficient LLM deployment. Particularly, we demonstrate the power of model pruning through combinatorial optimization, adding to the growing list of real-world applications of discrete optimization. Read more about our work: Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications: https://lnkd.in/dWHTuKJm Structured pruning with OSSCAR: https://lnkd.in/d8emmFQM Model quantization with QuantEase: https://lnkd.in/dZna796n 360Brew: A foundation model for personalized recommendation: https://lnkd.in/dUXydhaZ Kudos to our amazing team, and specially, Aman Gupta, Yun Dai, Qingquan Song and Ata Fatahi who made this work possible!