Getty Images/iStockphoto
Meta intros two GPU training clusters for Llama 3
The Facebook parent company said the training clusters are part of its plans to grow its infrastructure and obtain 350,000 Nvidia H100 GPUs by the end of the year.
Meta on Tuesday introduced two 24K GPU clusters that it is using to train its forthcoming Llama 3 LLM.
The two clusters were built using Grand Teton, Meta's in-house-designed open GPU hardware platform, and Open Rack, the tech giant's power and rack architecture. They were also built on PyTorch, the open source machine learning library for different applications, including computer vision and natural language processing.
Facebook's parent company also revealed plans to grow its infrastructure and obtain 350,000 Nvidia H100 GPUs as part of its portfolio by the end of the year. The GPUs cost about $40,000 each, making the planned acquisition of the GPUs a $14 billion outlay.
The move comes a month after Facebook CEO Mark Zuckerberg posted a Reel on Instagram, also owned by Meta, saying Meta's roadmap for AI requires "massive compute infrastructure."
Looking to innovate
Meta is trying to embody the cutting edge in AI infrastructure, said R "Ray" Wang, CEO of Constellation Research.
"Meta is competing to be in an open AI ecosystem of insight value exchange," he said.
He said the social media giant is trying to share signals or insights, compared to other vendors looking to provide enterprises with compute power.
While Meta builds up its own compute and software infrastructure, it gives away its Llama family of open source LLMs because it is betting on organizations using the generative AI technology to power products and services based on insights about consumer behavior shared on the largest social networks, all owned by Meta.
"Everybody else wants to provide you the infrastructure so that you can run AI," Wang continued. "Meta wants to provide you with the tools so you can use AI. They want to be able to use the output of all of that."
R 'Ray' WangCEO, Constellation Research
Meta also benefits from providing the tools in being able to gather more and better insights to help it drive the next best advancements in AI technology, he said.
The new AI superclusters build on Meta's history of building AI superclusters, Gartner Research analyst Chirag Dekate noted.
In January 2022, Meta introduced its AI Research Supercluster (RSC), a supercomputer to help AI researchers build new and better AI models.
The two GPU clusters are a natural progression, Dekate said.
"They're building these large GPU clusters to help them develop next-generation extreme-scale GenAI models," he said.
Not only will the GPU clusters help Meta build generative AI models like Llama 3 -- which is not yet in general availability but expected in July -- but it will also help the technology vendor as it likely develops multimodal models as part of the augmented reality and virtual reality products it builds, Dekate added.
While the GPU clusters do not directly affect enterprises, they will be of interest to enterprises looking to build on open source innovations.
"With the innovations that Meta builds, especially in the Llama portfolio, using this newest AI supercluster will be a core foundation that enterprises can also benefit from," Dekate said.
Challenges for Meta and enterprises
In addition to the financial cost, Meta's commitment to obtaining 350,000 Nvidia H100 GPUs means that Meta will need several megawatts of power capacity to power its cluster environments.
"These are going to be energy hogs," Dekate said. "They're going to be resource intensive."
Not only is power capacity a potential limitation, but so is supply, Wang said.
"Everybody is making these big announcements like they're buying this many GPUs from Nvidia," he said. "They're limited by production capacity. There's already a shortage."
For enterprises, the challenge is not to mistake plans like these from tech giants like Meta as a requirement to create their own GPU farms, Dekate said.
"For most enterprise-grade generative AI, you can actually do with a lot less," he said. "You can actually start with existing CPU infrastructures to solve small-scale models and enterprise [problems]."
However, for tech giants like Meta, the massive GPU clusters and computers let it to contribute to a broader open source ecosystem, he added.
Esther Ajao is a TechTarget Editorial news writer and podcast host covering artificial intelligence software and systems.