Solid assessment Lars. I'd add a few more LLM "providers" (or I'd call them "layers") Regarding (B) layer, we're also seeing inference run at the edge in addition to more centralized cloud providers. By this I mean CDN networks (fastly/cloudflare) that can run inference at an edge node in order to lower latency a bit to the end client. That said shaving off a few milliseconds of latency is pretty marginal given compute time is the largest bottleneck on response latency. There are other advantages inference at the edge could provide though, like caching responses that might be similar etc. Regarding (C) layer, I think that's gonna expand a fair amount to basically being an "embedded LLM" layer. Llama etc needs a pretty beefy machine to perform. Seems like there will be a future where IoT devices have smaller specialized models embedded for certain niche tasks, and then for more cpu bound tasks they will cascade up a chain that could look like going to the (B) layer, and failing that going to a SOTA (A) layer. Lastly there's also the possibility of an "on-premise" layer to get inference closer to the end client/IoT device but still have beefier compute. But that only makes since if bandwidth is bottleneck (ie video not text)
Here's how I think about the software stack for LLM inference, from a JS/TS dev point of view: There are 6 levels that build on one another: 1) The model: the actual model that will be executed at inference time. Sometimes it's the providers models (e.g. GPT-4 et al for OpenAI), sometimes you can choose yourself (download different GGUF files and run them with llama.cpp). When I say model I put fine-tunes, base models, and LORAs all in the same bucket for this post - it's the weights that are being used to infer the next token. 2) The model execution engine (model backend): the models need to be run in some runtime environment to process inputs and produce tokens. Some providers have their own engines for their own models (OpenAI, AnthropicAI), others let you run open source models in the cloud (e.g. FireworksAI), and then there are engines that you can use locally (llama.cpp). The engine needs to support the architecture of the model. Some providers wrap existing open source engines, e.g. ollama uses llama.cpp. 3) The API: the models are exposed through REST APIs mostly. With Llama.cpp, you can use bindings. With WebLLM, you can run in the browser. 4) The client library: various options here. Many providers standardize on the OpenAI client library these days, but others choose to have different libs (e.g. mistral, google, anthropic, ollama). With Llama.cpp you can use bindings in various languages, including JS (node bindings) or clients for the Llama.cpp server. 5) The orchestration framework: Handles how you integrate LLMs into apps, e.g. for chat, retrieval augmented generation (in combination with vector stores and embeddings), agents, etc. llama_index and LangChainAI are examples of orchestration frameworks. 6) UI integration: most JavaScript apps are client/server apps with a web frontend. It's important to move information from the server (where the API keys are) to the client, ideally with streaming. The Vercel AI SDK is an example of a UI integration library for AI. This means that there are 3 types of LLM providers: A) integrated providers (such as OpenAI, GoogleAI, Anthropic): they train and host their own proprietary models, have their own execution engines, their own API, and provide client libraries to work with their models. B) open-source cloud providers (such as Fireworks, Anyscale, TogetherAI): They host open source models (and often your own models) and provide a standardized API (often OpenAI compatible). C) local model providers (such as Llama.cpp, Ollama, webllm): you download and run the model on your machine. Some have their own client (e.g. Ollama). Right now the orchestration frameworks and the UI integration is separate from the backend LLM provider stack. Do you agree? How do you see these components evolve?