Moving Generative AI Past Transformers
for Efficient Language Models with Lower Compute Needs

Moving Generative AI Past Transformers for Efficient Language Models with Lower Compute Needs

The development of LLMs and multimodal AI models (like Gemini) has led to the belief that GPUs are foundational to the growth of AI. While GPUs are important, they are only a stepping stone, rather than the ultimate currency.

Here at Google - the birthplace of transformers - Google DeepMind published a research on a new architecture that can train LLMs with far fewer tokens, and therefore, lesser GPUs. You can read the paper: "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models"

During research, Griffin matched the performance of Llama-2, despite being trained on over 7 times fewer tokens. Griffin combines gated linear recurrences with local attention, achieving performance comparable to transformers on certain tasks while requiring fewer tokens for training. This translates to reduced computational needs and less reliance on GPUs. Moreover, Griffin's ability to handle extrapolation tasks and effectively learn to retrieve information further solidifies its potential.

Griffin performs better than Transformers when evaluated on sequences longer than those seen during training, and can also efficiently learn copying and retrieval tasks from training data.

Building upon Griffin's foundation, we introduced RecurrentGemma, a SOTA model that exemplifies the power of RNNs to deliver high performance with fewer tokens. You can read the paper: "RecurrentGemma: Moving Past Transformers for Efficient Open Language Models"

RecurrentGemma combines linear recurrences with local attention to achieve excellent performance on language tasks. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. This is in contrast to transformers, whose memory usage grows with sequence length. Additionally, RecurrentGemma-2B achieves similar performance while being trained on fewer tokens than Gemma-2B.

You can download RecurrentGemma on Hugging Face and deploy it on Vertex AI.

Just as Moore's Law predicted the exponential growth in transistor density, leading to smaller yet more powerful integrated circuits, Griffin lays the groundwork for future LLM architectures that will require fewer tokens and less computational power. This opens up exciting possibilities for creating more capable models without being constrained by compute availability and at a lower cost.

Therese Tarlinton

🎤 Keynote Speaker | ✨ Marketing Collaboration & Strategic Partnership Specialist | 🥇 2022 Book of the Year & Amazon Bestselling Author | 📗 Business & Incubator Advisor | 👩🦰 "The Multiplier Effect"

6mo

That's some forward-thinking insight. Looking beyond GPUs to streamline LLM architectures is key for future advancements.

To view or add a comment, sign in

More articles by Kishore Gopalan

Insights from the community

Others also viewed

Explore topics