Datasets and evals are the veggies you need to eat 🥦
Many developers focus on optimizing prompts and models when building RAG applications, instead of focussing on retrieval quality. But in practice, most issues like hallucinations and refusals stem from poor context quality, not a bad system prompt or the model's inherent reasoning abilities.
Here's why you should start by creating good evals and datasets for testing retrieval quality:
✅ Solid datasets reveal where your system fails to fetch relevant context so you can drive targeted improvements.
✅ You can't improve what you don't measure. Good metrics track progress over time and help guide improvements.
✅ Improving retrieval is usually faster than fine-tuning models and often leads to bigger gains (at least initially).
Here's how can you do this:
1️⃣ Create question/context pairs that mimic real queries: Use real-world queries from users and use LLMs to generate synthetic questions based on your context chunks. Even 20-40 questions can be a good starting point.
2️⃣ Use deterministic metrics: Implement deterministic evaluators (cosine similarity, context precision/recall) before moving on to LLM-as-a-judge metrics like Context Relevance.
3️⃣ Focus on improving relevant contexts in top-k retrievals: Experiment with embedding models, chunk sizes, and indexing methods. Implement hybrid retrieval combining dense and sparse methods. Use re-ranking models and explore query augmentation techniques.
4️⃣ Implement systematic testing and continuous monitoring: Use tools like HoneyHive to automate evaluations, perform A/B tests, monitor queries in production, and measure progress. This creates a virtuous cycle of testing, monitoring, and improvement to help you drive systematic improvements.
5️⃣ Iterate and refine: Regularly update your dataset with new queries. Continuously retrain embedding models on your evolving corpus. Tune your evaluator prompt if you're using LLM-as-a-judge.
Remember: Quality context is key. Evals help you systematically improve context quality, which is the key to solving downstream issues like hallucinations. By getting the right context, you're setting up your entire application for success – from retrieval all the way to the final output.