LLM inference is mostly memory bound. An 12-channel Epyc Genoa with 4800MT/s DDR...

hedgehog · 2024-04-10T20:35:25 1712781325

I'm curious how the newer consumer Ryzens might fare. With LPDDR5X they have >100 GB/s memory bandwidth and the GPUs have been improved quite a bit (16 TFLOPS FP16 nominal in the 780M). There are likely all kinds of software problems but setting that aside the perf/$ and perf/watt might be decent.

cjbprime · 2024-04-11T04:49:20 1712810960

Consumer Ryzens only have two-channel memory controllers. Two dual-rank (double sided) DIMMs per channel, which you would need to use to get enough RAM for LLMs, drops the memory bandwidth dramatically -- almost all the way back down to DDR4 speeds.

timschmidt · 2024-04-11T07:03:12 1712818992

Yup. Strix Halo will change this, with a 256bit memory bus (4 channel) which CPU and GPU have access to. However it is only likely to be available in laptop designs and probably with soldered-down RAM to reduce timing and board space issues. So it won't be easy to get enough memory for large LLMs with either. But it should be faster than previous models for LLM work.

hedgehog · 2024-04-11T20:36:21 1712867781

For consumer Ryzen to pencil out it would require a cluster of APU-equipped machines with the model striped across them. Given say 16GB of model per machine and 60GBps actual memory bandwidth @ $500 it's favorable vs A100s if the software is workable (which my guess is it's not today due to AMD's spotty support). This is for inference, training probably would be too slow due to interconnect overhead.

sliken · 2024-04-11T03:40:09 1712806809

If you Epyc's are too pricey, there's the Threadripper pro, 8 channels. AMD Siena/8000 series with 6 channels, and and Threadripper with 4 channels.

hmottestad · 2024-04-12T10:26:53 1712917613

That's interesting. It's about the same speed as the M3 Max then.

Have you tested it yourself?

Manabu-eo · 2024-04-18T14:09:39 1713449379

Nope, but this guy has a similar build: https://www.reddit.com/r/LocalLLaMA/comments/1bt8kc9/compari...

It seems to reach only a little above half the theoretical speed, and scale only up to 32 threads for some reason. Might be a temporary software limitation or something more fundamental.

timschmidt · 2024-04-12T10:48:27 1712918907

Should be at least twice the speed of the M3 Max, as the M3 CPU or GPU only get about half the memory bandwidth available to the package each. M3 Max can't take full advantage of it's memory bandwidth unless CPU, GPU, and NPU are all working at the same time.

hmottestad · 2024-04-12T16:45:01 1712940301

I tried looking for some info on this but could only find the M1 Max review over at anandtech that managed to push 200 GB/s when using multiple cores on the CPU, but couldn’t really get any numbers for just the GPU that seemed realistic.

Do you have a source for the GPU only having access to half the bandwidth of the memory?