Graphcore’s Post

View organization page for Graphcore, graphic

30,952 followers

Accelerate your AI intelligence with our research team's regular digest of the most consequential new papers. https://lnkd.in/geY4dkAA

TriForce, QuaRot, Mixture-of-Depths: Papers of the Month (Apr 2024)

TriForce, QuaRot, Mixture-of-Depths: Papers of the Month (Apr 2024)

graphcore.ai

The most advanced MoE models needs optimization of block sparsity for group GEMM, while this is where IPU has been intensively optimized. The MoE sparsity is now proven number 1 choice of sparsity among dense model pre-train since its economic advantages and comparable good performance over dense models. 2 months ago NV picked up <grouped GEMM> implementation and integrated them into megatron by fusion of topk and gating score functions. In the most sophisticated MoE training, distinct experts (8~162) can be distributed over DP group to for EDP parallel schema. Now deepseek-V2 proved with lora and more experts (select 6 from 160 experts) and shared experts (2 shared experts) can reduce KV cache significant during prefill stage while keep great performance. 3 mongth ago, researchers aslo shows that tokens of specific genres are more like be routed and selected by some expert. This enables branch prediction in successive layers after the first layer prediction. With all this techniques combined, the MoE sparsity creates and will create the most advanced inference experiences.

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics