AI researchers run AI chatbots at a lightbulb-esque 13 watts with no performance loss — stripping matrix multiplication from LLMs yields massive gains

LED lightbulbs, which usually consume about 10 Watts of power a piece.
LED lightbulbs, which usually consume about 10 Watts of power a piece. (Image credit: Getty Images)

A research paper from UC Santa Cruz and accompanying writeup discussing how AI researchers found a way to run modern, billion-parameter-scale LLMs on just 13 watts of power. That's about the same as a 100W-equivalent LED bulb, but more importantly, its about 50 times more efficient than the 700W of power that's needed by data center GPUs like the Nvidia H100 and H200, never mind the upcoming Blackwell B200 that can use up to 1200W per GPU.

The work was done using custom FGPA hardware, but the researchers clarify that (most) of their efficiency gains can be applied through open-source software and tweaking of existing setups. Most of the gains come from the removal of matrix multiplication (MatMul) from the LLM training and inference processes.

How was MatMul removed from a neural network while maintaining the same performance and accuracy? The researchers combined two methods. First, they converted the numeric system to a "ternary" system using -1, 0, and 1. This makes computation possible with summing rather than multiplying numbers. They then introduced time-based computation to the equation, giving the network an effective "memory" to allow it to perform even faster with fewer operations being run.

The mainstream model that the researchers used as a reference point is Meta's LLaMa LLM. The endeavor was inspired by a Microsoft paper on using ternary numbers in neural networks, though Microsoft did not go as far as removing matrix multiplication or open-sourcing their model like the UC Santa Cruz researchers did.

It boils down to an optimization problem. Rui-Jie Zhu, one of the graduate students working on the paper, says, "We replaced the expensive operation with cheaper operations." Whether the approach can be universally applied to AI and LLM solutions remains to be seen, but if viable it has the potential to radically alter the AI landscape.

We've witnessed a seemingly insatiable desire for power from leading AI companies over the past year. This research suggests that much of this has been a race to be first while using inefficient processing methods. We've heard comments from reputable figures like Arm's CEO warning that AI power demands continuing to increase at current rates would consume one fourth of the United States' power by 2030. Cutting power use down to 1/50 of the current amount would represent a massive improvement.

Here's hoping Meta, OpenAI, Google, Nvidia, and all the other major players will find ways to leverage this open-source breakthrough. Faster and far more efficient processing of AI workloads would bring us closer to human brain levels of functionality — a brain gets by with approximately 0.3 kWh of power per day by some estimates, or 1/56 of what an Nvidia H100 requires. Of course, many LLMs require tens of thousands of such GPUs and months of training, so our gray matter isn't quite outdated just yet.

Christopher Harper
Contributing Writer

Christopher Harper has been a successful freelance tech writer specializing in PC hardware and gaming since 2015, and ghostwrote for various B2B clients in High School before that. Outside of work, Christopher is best known to friends and rivals as an active competitive player in various eSports (particularly fighting games and arena shooters) and a purveyor of music ranging from Jimi Hendrix to Killer Mike to the Sonic Adventure 2 soundtrack.

  • rluker5
    Excellent.
    Reply
  • abufrejoval
    Cool, now the SoC all existing NPU implementations are also architecturally outdated before vendors could ever prove why you would want Co-Pilot on your personal computer in the first place.
    Reply
  • hotaru251
    This is great for efficiency standpoint but I do wonder if Nvidia will try to take this and just scale it to use more power for more performance. (since they long ago stopped caring about power usage)

    hopefully it doesnt scale like that.
    Reply
  • salgado18
    abufrejoval said:
    Cool, now the SoC all existing NPU implementations are also architecturally outdated before vendors could ever prove why you would want Co-Pilot on your personal computer in the first place.
    From the news:
    The work was done using custom FGPA hardware, but the researchers clarify that (most) of their efficiency gains can be applied through open-source software and tweaking of existing setups
    Which is very good news.
    Reply
  • DS426
    Yep, hilarious as NPU's exist specifically for matrix math and now that might not be needed as a dedicated slice of silicon.

    No one has a crystal ball but big tech has treated us mere peasants as complete fools for not going all in as fast as possible on "AI." This is what you get. I actually hope that this new technique doesn't scale much above 50% on existing AI-purpose-built GPU's as I feel like that crisp smack in the face is fully due for. I'm also not saying it doesn't have it's place and progress is good when responsible, but it hasn't been responsible since the news blew up over ChatGPT; when did big business become such big risk takers??

    Oh and the best part: AI didn't figure out this optimization problem like all those hopefuls anticipated -- it was us with gray matter.
    Reply
  • DS426
    OPEN SOURCE? Oh no, don't mention that part to nVidia. :ROFLMAO:
    Reply
  • frogr
    How much energy is required to complete the computation? Watts x seconds = joules.
    The comparison in watts is only valid if both methods take the same amount of time
    Reply
  • abufrejoval
    salgado18 said:
    From the news:

    Which is very good news.
    Not sure what you mean... did you read the publication, too?

    Because AFAIK these NPUs on the current SoCs aren't in fact FPGAs, but really GPGPU units that are optimized for MatMul on 4-16bit of precision with distinct variants of fixed or floating point weights. And this new implementation eliminates that to a very large degree or in many of those model layers, thus eliminating the benefits of those NPUs.

    At the same time their new operations can't be efficiently implemented by some type of emulation via the CPU or even a normal GPU, it requires FPGAs or an entirely new ASIC or IP block.

    If you look at the chip diagrams for this emerging wave of SoCs and see the rather large chunks that are being dedicated to NPUs these days, having those turn effectively into dark silicon before they are even coming to the market, isn't good news for the vendors, who've been banging about the need for AI PCs for some time now.

    This can't be implemented or fixed in hardware that is currently being manufactured, if that's what you understood from the article.

    New hardware implementing this would relatively cheap to make and operate with vastly higher efficencies, but it's new hardware that needs to fit somewhere. And while it should be easy to put on the equivalent of an Intel/Movidius Neural Stick or an M.2 equivalent, that's either clumsy or hard to fit in emerging ultrabooks.

    It's good news for someone like me, who really wouldn't want to pay for the NPU because I don't want any Co-Pilot on my PCs, because currently manufactured chips might come down in price quickly.

    But vendors won't be happy.
    Reply
  • bit_user
    DS426 said:
    it hasn't been responsible since the news blew up over ChatGPT; when did big business become such big risk takers??
    I think everyone was worried that AI will be game-changing, like the Internet was. Many companies who failed to predict or respond to the way their business was affected by the internet are no longer with us. And many of the big tech companies today are those which got their start during the internet boom (Google, Facebook, Amazon).

    abufrejoval said:
    But vendors won't be happy.
    I have yet to read the paper, but I'd be cautious about overreacting. They talk specifically about language models, so it might not apply to other sorts of models, like those used for computer vision or image generation.
    Reply
  • bit_user
    frogr said:
    How much energy is required to complete the computation? Watts x seconds = joules.
    The comparison in watts is only valid if both methods take the same amount of time
    The magazine article about it claims "> 50 times more efficient than typical hardware."
    https://news.ucsc.edu/2024/06/matmul-free-llm.html
    Another noteworthy quote from that article:
    "On standard GPUs, the researchers saw that their neural network achieved about 10 times less memory consumption and operated about 25 percent faster than other models. Reducing the amount of memory needed to run a powerful large language model could provide a path forward to enabling the algorithms to run at full capacity on devices with smaller memory like smartphones."
    This could be great for iGPUs and NPUs, which tend to have much less available memory bandwidth than dGPUs.

    Also:
    "With further development, the researchers believe they can further optimize the technology for even more energy efficiency."
    So, I guess the upshot is: sell Nvidia, buy AMD (Xilinx)? Intel is probably going to kick themselves for spinning off Altera.
    Reply