Hacker News new | past | comments | ask | show | jobs | submit login

262 GB is not exactly small. But yes it seems they're all getting them out the door in case they end up being worse than llama-3 in which case it'll be too embarrassing to release later.



Since it’s a MOE model it will only need to load a few of the 8 sub models into vram in order to answer a query. So it may look large, but I think a quantized model will easily fit on a Mac with 64GB of memory and maybe even a bit fewer bits and it’ll fit into 32GB.

I think it might be the end for 24GB 4090 cards though :(


MOE models don’t, in practice, selectively load experts on activation (and if a runtime for them could be designed that would do that, it would make them perform worse, since the experts activated may differ from token to token, so you’d be churning a whole lot swapping portions of the model into and out of VRAM.) But they do less computation per token for their size than monolithic so you can often get tolerable performance on CPU or split between GPU/CPU at a ratio that would work poorly with a similarly-sized monolithic model.

But, still, its going to need 262GB for weights + a variable amount based on context without quantization, and 66GB+ at 4-bit quantization.


Unless something has changed, it needs to load the full 8 models at the same time. During inference it performs like a 2 x base model.

Mixtral 7B @ 5 bit takes up over 30gb on my M3 Max. That's over 90 for this at the same quantization. Realistically you probably need a 128gb machine to run this with good results.


A 4 bit quant of the new one would still be about 70 gb, so yeah. Gonna need a lot more ram.


The 8x is misleading; there are 8 sets of weights (experts) per token and per layer. If it is similar to the previous MoE Mistral models, then two experts get activated per token per layer. This reduces the amount of compute and memory bandwidth you need to perform inference but doesn't reduce the amount of memory you need as you cannot load the experts into GPU memory on demand without performance impact.


I think you are an optimist here. I can barely run mixtral-8x-7B on my M2 Pro 32G Mac, but I am grateful to be able to run it at all.


Which quantization level are you using?


Q2, so not so great. I usually run other models. I would be embarrassed to tell you how long my “ollama list” is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:
  翻译: