We've spoken to >100 teams building with language models. These are our key insights! Issue 1: Getting “high quality” and “consistent” outputs is hard - a small error rate in a LLM chain = large error rate from a pipeline perspective (the LLM butterfly effect) - prompt engineering is a continuous task to improve quality and consistency of output to fit the usecase Issue 2: Managing Prompts in Git, Notion, Google Sheets, Markdown sucks - prompt management slows iteration cycles with delays in updating prompt versions with full CI/CD rebuilds - non-technical users struggle to use Git to update prompts Issue 3: It’s hard to detect where LLM pipelines or Agent workflows break - root causes can be diverse - there are limited solutions which can flex both a "micro" view of an overall log and a "macro" view of how a system is performing in aggregate Issue 4: Setting up evaluations is underused - everyone is "vibes testing" and most do not have defined evals - LLM evaluations on every query or response is being used in regulated markets like EdTech and FinTech Issue 5: Data and fine-tuning - a lot of people are tracking logs but almost no-one is fine-tuning - People are not algorithmically optimising prompts for their usecase The best prompters of language models are language models themselves. But no-one we talked to is actively using these techniques. DM me to find out we can help. #llm #languagemodel #openai #ai
Outerop
Technology, Information and Internet
Build reliable, high quality, LLM products. Stop "prompt engineering" and start creating self-optimising LLM pipelines
About us
- Website
-
www.outerop.com
External link for Outerop
- Industry
- Technology, Information and Internet
- Company size
- 2-10 employees
- Type
- Privately Held