Every successful AI engineering / product development team we work with reaches a similar conclusion: One of the best ways to make generative AI products work well in production is to simply to look at lots of row-level data, then iterate.
But doing that work can feel daunting, and expensive. It doesn't have to be!
We wrote up some thoughts on best practices for data labeling & review for teams running AI products in production -- plus some highlights that show how Freeplay helps those teams work faster together.
tl;dr on what we're talking about as a basic workflow (more in the blog post, link in the comments):
1. Look at lots of data.
2. Label it.
3. Create or tune evals. (to catch issues automatically the next time)
4. Curate interesting data into datasets. (for testing & fine-tuning)
5. Iteratve based on what you learn.
If you're already down this path and looking for better ways to do the work, some of the key ways Freeplay helps:
👥 Build multi-player data labeling workflows: Teams of analysts can build custom queues using our Live Filters feature, and define their own labeling criteria -- all in the same observability tool used by engineers and PMs.
🏷️ Label data and improve eval quality at the same time: As part of normal reviews, people can correct or confirm auto-eval values — which then turn into benchmark datasets for improving eval quality.
💽 Build datasets in the same place that you use them: By having everything in one place, it’s easy to launch new tests or experiments and trust you’ve got the most up-to-date data.
🧪 Fully integrated experimentation: See something out of place and think you know a fix? Open up your datasets in our playground, make an update, run a test with your full eval suite, and drop a link to PMs or Engineers to check out the results – all without having to touch or deploy code.
Blog post + demo video in the comments! Let us know what you think. 👇