Hacker News new | past | comments | ask | show | jobs | submit login

What's the easiest way to run this assuming that you have the weights and the hardware? Even if it's offloading half of the model to RAM, what tool do you use to load this? Ollama? Llama.cpp? Or just import it with some Python library?

Also, what's the best way to benchmark a model to compare it with others? Are there any tools to use off-the-shelf to do that?




I think the llamafile[0] system works the best. Binary works on the command line or launches a mini webserver. Llamafile offers builds of Mixtral-8x7B-Instruct, so presumably they may package this one up as well (potentially a quantized format).

You would have to confirm with someone deeper in the ecosystem, but I think you should be able to run this new model as is against a llamafile?

[0] https://github.com/Mozilla-Ocho/llamafile


llamafile author here. I'm downloading Mixtral 8x22b right now. I can't say for certain it'll work until I try it, but let's keep our fingers crossed! If not, we'll be shipping a release as soon as possible that gets it working.

My recent work optimizing CPU evaluation https://justine.lol/matmul/ may have come at just the right time. Mixtral 8x7b always worked best at Q5_K_M and higher, which is 31GB. So unless you've got 4x GeForce RTX 4090's in your computer, CPU inference is going to be the best chance you've got at running 8x22b at top fidelity.


Correct me if I'm wrong, but in the tests I've run, the matmul optimizations only have an effect if there's no other blas acceleration. If one can at least offload the KV cache to cublas or run with openblas it's not really used, right? At least I didn't see any speedup in with that config when comparing that PR to the main llama.cpp branch.


The code that launches my code (see ggml_compute_forward_mul_mat) comes after CLBLAST, Accelerate, and OpenBLAS. The latter take precedence. So if you're not seeing any speedup in enabling them, it's probably because tinyBLAS has reached terms of equality with the BLAS. It's obviously nowhere near as fast as cuBLAS, but maybe PCIE memory transfer overhead explains it. It also really depends on various other factors, like quantization type. For example, the BLAS doesn't support formats like Q4_0 and tinyBLAS does.


+1 on llamafile. You can point it to a custom model.


The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness)


In that regard, it’s even easier to use one Apple Studio with sufficient RAM and llama.cpp or even PyTorch for inference.


LM Studio is a great way to test out LLMs on my MacBook: https://lmstudio.ai/

Really easy to search huggingface for new models to test directly in the app.


Make sure you get the prompt template set correctly, the defaults are wrong for a lot of models.


Could you explain how to do this properly ? I've been having problems with the app and am wondering if this is ehy


Look at the HuggingFace page for the model you are using. (The original page, not the page for the GGUF conversion, if necessary.) This will explain the chat format you need to use.


There is a user called The Bloke on hugging face- they release pre quantized models pretty soon after the full size drop. Just watch their page and pray you can fit the 4 bit in your GPU.

I’m sure they are already working on it.


TheBloke stopped uploading in January. There are others that have stepped up though.


Oh really? Who else should I be looking at?

That person is a hero, super bummed!


TheBloke's grant ran out.


I think 4b for this is support to be over 70GB, so definitely still heavy hardware.


Fucking hell, my A6000 is shy of that and I can’t reasonably justify picking up a second.





Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:
  翻译: