Hacker News new | past | comments | ask | show | jobs | submit login

4 bit quants should require 85GB VRAM, so this will fit nicely on 4x 24G consumer GPUs, plus some leftover for KV cache optimization.



4bit should take up less than this, there are quite a few shared parameters between experts.

But unless you’re running bs=1 it will be painful vs 8x GPU as you’re almost certain to be activating most/all of the experts in a batch.


I've found the 2 bit quant of Mixtral 8x7B is usable for some purposes with an 8GB GPU. I'm curious how this new model will work in similar cheap 8-16GB GPU configurations.


16GB will be way too small unfortunately — this has over 3x the param count, so at best you're looking at a 24GB card with extreme 2bit quantization.

Really though if you're just looking to run models personally and not finetune (which requires monstrous amounts of VRAM), Macs are the way to go for this kind of mega model: Macs have unified memory between the GPU and CPU, and you can buy them with a lot of RAM. It'll be cheaper than trying to buy enough GPU VRAM. A Mac Studio with 192GB unified RAM is under $6k — two A6000s will run you over $9k and still only give you 96GB VRAM (and God help you if you try to build the equivalent system out of 4090s or A100s/H100s).

Or just rent the GPU time as needed from cloud providers like RunPod, although that may or may not be what you're looking for.


Reasonably priced Epyc systems with up to 12 memory channels and support for several TB of system memory are now available. Used datacenter hardware is even less expensive. They are on par with the memory bandwidth available to any one of the CPU, GPU, or NPU in the highest end Macs, but capable of driving MUCH more memory. And much simpler to run Linux or Windows on.


I would be very curious to see pricing on Epyc systems with terabytes of RAM that cost less than $6k including the RAM...


Well the motherboard and CPU can be had for $1450. As they're built around standard cases and power supplies and storage, many folks like me will have those already - far less costly than buying the same from Apple if you don't. Spend what you want on ram, unlike with Apple, you can upgrade it any time.

Can't reuse my old parts on a brand new Mac, or upgrade it later if I find I need more. Lock-in is rough.

https://www.ebay.com/itm/315029731825?itmmeta=01HV561YV4AJG5...


Note that this is a "QS" CPU, very likely B0 stepping ES by posts elsewhere. A new one of those is around 3k USD alone. The 16-core can be had for around $1200 however and the board for $780.

12x32=384 GB of RAM seems to be about $1400 right now. Going for less capacity don't save that much, unlike the insanely marked up apple memory. And then you need the CPU heatsink for $130.



I'm only seeing the errata for the B1 stepping there, not the B0 stepping that are those "QS".


Good catch! I'd missed that. Still, experiences of folks in the level1tech forums are positive: https://forum.level1techs.com/t/genoa-9654-qs-experiences/19...


Do you have any feel for the performance compared to the M3 Max?


LLM inference is mostly memory bound. An 12-channel Epyc Genoa with 4800MT/s DDR5 ram clocks at 460.8 GB/sec. It's more than the 400GB/s of the M3 Max, only part of that accessible for the CPU.

And on the Epyc System you can plug much more memory for when you need larger memory and PCI-E gpus, for when you need less faster memory.

Threadripper PRO is only 8-channel, but with memory overclocking it might reach numbers similar to those too.


I'm curious how the newer consumer Ryzens might fare. With LPDDR5X they have >100 GB/s memory bandwidth and the GPUs have been improved quite a bit (16 TFLOPS FP16 nominal in the 780M). There are likely all kinds of software problems but setting that aside the perf/$ and perf/watt might be decent.


Consumer Ryzens only have two-channel memory controllers. Two dual-rank (double sided) DIMMs per channel, which you would need to use to get enough RAM for LLMs, drops the memory bandwidth dramatically -- almost all the way back down to DDR4 speeds.


Yup. Strix Halo will change this, with a 256bit memory bus (4 channel) which CPU and GPU have access to. However it is only likely to be available in laptop designs and probably with soldered-down RAM to reduce timing and board space issues. So it won't be easy to get enough memory for large LLMs with either. But it should be faster than previous models for LLM work.


For consumer Ryzen to pencil out it would require a cluster of APU-equipped machines with the model striped across them. Given say 16GB of model per machine and 60GBps actual memory bandwidth @ $500 it's favorable vs A100s if the software is workable (which my guess is it's not today due to AMD's spotty support). This is for inference, training probably would be too slow due to interconnect overhead.


If you Epyc's are too pricey, there's the Threadripper pro, 8 channels. AMD Siena/8000 series with 6 channels, and and Threadripper with 4 channels.


That's interesting. It's about the same speed as the M3 Max then.

Have you tested it yourself?


Nope, but this guy has a similar build: https://www.reddit.com/r/LocalLLaMA/comments/1bt8kc9/compari...

It seems to reach only a little above half the theoretical speed, and scale only up to 32 threads for some reason. Might be a temporary software limitation or something more fundamental.


Should be at least twice the speed of the M3 Max, as the M3 CPU or GPU only get about half the memory bandwidth available to the package each. M3 Max can't take full advantage of it's memory bandwidth unless CPU, GPU, and NPU are all working at the same time.


I tried looking for some info on this but could only find the M1 Max review over at anandtech that managed to push 200 GB/s when using multiple cores on the CPU, but couldn’t really get any numbers for just the GPU that seemed realistic.

Do you have a source for the GPU only having access to half the bandwidth of the memory?


You can QLoRA decent models on 24GB VRAM. There’s also optimised kernels like Unsloth that are really VRAM efficient and good for hobbyists.


Yes, but I still don't think you'll be able to run Mixtral 8x22b with 16GB VRAM, or QLoRA it, even with Unsloth. It's much bigger than the original Mixtral.


AFAIK, 2-bit quant leads to too much loss of performance, such that you're better off using a different smaller model altogether. See here:

https://www.reddit.com/r/LocalLLaMA/comments/18ituzh/mixtral...


Wouldn't expect that to work at all.


Ollama (which wraps llama.cpp) supports splitting a model across devices so you get some acceleration even on models too big to fit entirely in GPU memory.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:
  翻译: