The truth is, scaling law is very real from model size of 7B to let’s say 70B.
Yes, Apple has made great efforts recently towards locally run AI agent and generative features, whether it’s physical hardware like neural engines on every Apple chip, or RAM management https://lnkd.in/esqb_WTn, or mlx framework https://lnkd.in/e2V9FC4y, or really awesome open LLM project like MM1 https://lnkd.in/eSg6Bfzq
But they probably still can’t get a model to run locally on Apple devices and at the same time have high TPS, low latency and reliable performance. Especially since so many people have gotten used to the GPT-4 class model experience, the entry threshold is really high. Rushing into it is a risky move on Apple’s part.
Using mlx, I can run a quantized 4bit Mistral7B model locally on my MBA, it’s really fast. But I basically observed factual hallucinations in every inference. Is that really useful in any real case? Probably not.
I think this explains the collaboration between Apple and Google. They already have strong relationship since Google pays Apple $18B annually to be the default search engine.
On a related side note, people working with open source models now suddenly find themselves hand bound when it comes to Grok-1. How to finetune it? What are the real use cases for a open model this big even after finetuned? These are very real issues now.
GPU poor is clamping the reality of open model dev and local model implementations.