Epoch AI’s Post

View organization page for Epoch AI, graphic

2,163 followers

Can AI scaling continue through 2030? Our new report examines whether constraints on power, chip manufacturing, training data, or data center latencies might hinder AI growth. Our analysis suggests that AI scaling can likely continue its current trend through 2030. Training state-of-the-art AI models requires a massive amount of computation, growing by 4x every year. If this trend continues, we will see training runs 10,000x larger than GPT-4 by 2030. Achieving such a scale-up would require immense resources. For each key bottleneck (power, chips, data, and latencies), we project potential scale using semiconductor foundries' expansion plans, electricity providers' forecasts, industry data, and our own research. 🔌 Power: Meta's Llama 3.1 405B training used 16,000 H100 GPUs, consuming about 30MW. By 2030, the largest training runs could demand 5GW of power, accounting for efficiency gains and increased training durations. A data center with a >1GW capacity would be unprecedented, but is in line with stated industry plans. Distributed training runs across US states could further surpass this, doubling or even 10x-ing the power a single campus could muster. 💾 Chip production: 16,000 H100s is far from the tens of millions of chips needed to scale 10,000x beyond GPT-4. While GPU production is constrained by advanced packaging and high-bandwidth memory, foundries like TSMC are on track to expand their capacity and meet this demand. Planned scale-ups and efficiency gains could enable 100M H100-equivalent GPUs to be dedicated to a 9e29 FLOP training run by 2030, accounting for the fact that GPUs will be distributed between several labs and between training and inference. This could be much higher if most of TSMC’s top wafers went to AI. 📚 Training data: All indexed web text could be enough for several thousand-fold larger training runs today. By 2030, this stock of data may have grown enough for a 10,000x scale-up. Multimodal data (image, video, audio) could expand AI training scale by ~10x. Synthetic data shows promise in domains like coding/math, but risks model collapse. Synthetic data could enable multiple orders of magnitude more scaling, but with increased compute costs. ⏳ Latency: As models grow, they need more sequential ops per example they are trained on, limiting the size of training runs. Increasing batch size helps, but this has diminishing returns. On modern hardware, these latency constraints would keep runs to ~1e32 FLOP. Exceeding this would require new network designs or lower-latency hardware. Despite these significant bottlenecks, our estimates suggest they won't significantly slow the growth rate of AI training runs. This suggests we could see another major scale-up—comparable to the jump from GPT-2 to GPT-4—by 2030. You can learn more about each of the bottlenecks, and our assumptions, by reading the full report here: https://lnkd.in/dKAnUYGJ

If we continue to "scale" the methods that passes as AI today for another 2-3 years, I would be surprised. The methods are inelegant and profoundly inefficient - and substituting massive statistical modeling for reasoning will never work, as shown by the increasingly more than anecdotal amusing results. A more likely scenario is that some clever AI researchers that are not hypnotized by throwing computation at a problem rather than designing a solution will come up with a much more comprehensive approach. From a mathematics perspective - throwing more and more computation is rarely (never) the optimal solution. Finding an efficient algorithm is orders of magnitude more important.

Nils Ulltveit-Moe

Associate Professor at University of Agder

4d

That is assuming that no other disruptive Ai technologies will take over from deep neural network based training, such as #LiteralLabs #Tsetlinmachine, which is vastly more energy and Computational ly efficient than neural networks.

Like
Reply

Nice!

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics