What's the easiest way to run this assuming that you have the weights and the ha...

fbdab103 · 2024-04-10T02:44:45 1712717085

I think the llamafile[0] system works the best. Binary works on the command line or launches a mini webserver. Llamafile offers builds of Mixtral-8x7B-Instruct, so presumably they may package this one up as well (potentially a quantized format).

You would have to confirm with someone deeper in the ecosystem, but I think you should be able to run this new model as is against a llamafile?

[0] https://github.com/Mozilla-Ocho/llamafile

jart · 2024-04-10T05:35:01 1712727301

llamafile author here. I'm downloading Mixtral 8x22b right now. I can't say for certain it'll work until I try it, but let's keep our fingers crossed! If not, we'll be shipping a release as soon as possible that gets it working.

My recent work optimizing CPU evaluation https://justine.lol/matmul/ may have come at just the right time. Mixtral 8x7b always worked best at Q5_K_M and higher, which is 31GB. So unless you've got 4x GeForce RTX 4090's in your computer, CPU inference is going to be the best chance you've got at running 8x22b at top fidelity.

moffkalast · 2024-04-10T11:16:16 1712747776

Correct me if I'm wrong, but in the tests I've run, the matmul optimizations only have an effect if there's no other blas acceleration. If one can at least offload the KV cache to cublas or run with openblas it's not really used, right? At least I didn't see any speedup in with that config when comparing that PR to the main llama.cpp branch.

jart · 2024-04-10T12:10:33 1712751033

The code that launches my code (see ggml_compute_forward_mul_mat) comes after CLBLAST, Accelerate, and OpenBLAS. The latter take precedence. So if you're not seeing any speedup in enabling them, it's probably because tinyBLAS has reached terms of equality with the BLAS. It's obviously nowhere near as fast as cuBLAS, but maybe PCIE memory transfer overhead explains it. It also really depends on various other factors, like quantization type. For example, the BLAS doesn't support formats like Q4_0 and tinyBLAS does.

noman-land · 2024-04-10T02:52:11 1712717531

+1 on llamafile. You can point it to a custom model.

varunvummadi · 2024-04-10T02:44:00 1712717040

The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness)

sheepscreek · 2024-04-10T22:21:21 1712787681

In that regard, it’s even easier to use one Apple Studio with sufficient RAM and llama.cpp or even PyTorch for inference.

hmottestad · 2024-04-10T16:14:56 1712765696

LM Studio is a great way to test out LLMs on my MacBook: https://lmstudio.ai/

Really easy to search huggingface for new models to test directly in the app.

LeoPanthera · 2024-04-10T19:34:16 1712777656

Make sure you get the prompt template set correctly, the defaults are wrong for a lot of models.

unifer1 · 2024-04-10T23:50:59 1712793059

Could you explain how to do this properly ? I've been having problems with the app and am wondering if this is ehy

LeoPanthera · 2024-04-11T00:23:24 1712795004

Look at the HuggingFace page for the model you are using. (The original page, not the page for the GGUF conversion, if necessary.) This will explain the chat format you need to use.

bevekspldnw · 2024-04-11T15:14:34 1712848474

There is a user called The Bloke on hugging face- they release pre quantized models pretty soon after the full size drop. Just watch their page and pray you can fit the 4 bit in your GPU.

I’m sure they are already working on it.

nathanasmith · 2024-04-11T16:54:55 1712854495

TheBloke stopped uploading in January. There are others that have stepped up though.

bevekspldnw · 2024-04-11T18:20:49 1712859649

Oh really? Who else should I be looking at?

That person is a hero, super bummed!

fzzzy · 2024-04-11T19:38:08 1712864288

TheBloke's grant ran out.

MPSimmons · 2024-04-11T17:18:52 1712855932

I think 4b for this is support to be over 70GB, so definitely still heavy hardware.

bevekspldnw · 2024-04-11T18:40:07 1712860807

Fucking hell, my A6000 is shy of that and I can’t reasonably justify picking up a second.

mritchie712 · 2024-04-10T14:08:40 1712758120

you can try it on together here:

https://api.together.xyz/playground/language/mistralai/Mixtr...