I’ve been testing a lot of LLMs on my MacBook and I would say that all of them a...

Eisenstein · 2024-04-11T17:35:54 1712856954

Which ones have you tested? There were some huge ones released recently.

hmottestad · 2024-04-12T07:26:22 1712906782

Samantha, llama 2 pubmed, marcoroni, openchat, fashiongpt, falcon 180B, deepseek llm chat, orca 2, orca 2 alpac uncersored, meditron, tigerbot, mixtral instruct, wizardcoder, gemma, nouse hermes 2 solar, yarn solar 64k, nouse hermes 2 yi, nous hermes 2 mixtral, nouse hermes llama 2, starcode2, hermes 2 pro mistral, norskgpt mistral and norskgpt llama.

Nouse Hermes 2 Solar is the best model for Norwegian that I've tried so far. It's much better than NorskGPT Mistral/Llama. I actually got it to make fairly decent summaries of news articles, though it wouldn't follow any stricter commands like producing 5 keywords in a json list. Kept producing more than 5 keywords and if I doubled down on the restriction on the number of keywords it would start messing up the json.

The best competitor to GPT-4 was falcon 180b, it's still terrible compared to GPT-4. Mixtral is my new favourite though, it's faster than falcon and in general as good or better. Though I would still pick GPT-4 over Mixtral any day of the week, it's leagues ahead of Mixtral.

Tigerbot has a very interesting trait. It tends to disagree when you try to convince it that it's wrong.

I haven't been able to test out the new 8x22 mixtral or command r plus. These are the next ones on my list!

hmottestad · 2024-04-12T10:32:57 1712917977

Just tested out Command R+ with some niche SHACL constraint questions and it performs considerably worse than GTP-4. Might be a bit better than GPT-3.5 though, which is actually pretty amazing.

Eisenstein · 2024-04-13T17:13:07 1713028387

You need to use their beginning and end token scheme and set rep pen to 1 to get good quality out of cr+.