HoneyHive is the leading observability and evaluation platform for AI applications. From development to deployment, we make it easy for teams to debug, evaluate, and monitor AI applications and ship Generative AI products with confidence.
HoneyHive’s founding team brings AI expertise from Microsoft, Amazon, and JP Morgan, where they were involved with some of the earliest Generative AI projects. The company is based in New York and San Francisco.
We just released a new cookbook for evaluating text-2-sql applications at HoneyHive. This matters because turning natural language into database queries is something many companies I talk to need, but it's tricky to get right.
The challenge with these systems is they can fail in multiple ways: misunderstanding what the user wants, creating broken SQL code, or returning the wrong results. In this cookbook, we test how different AI models (GPT-4o, Gemini 2.0 Flash, and Claude 3.7 Sonnet) handle these challenges.
Our methodology is straightforward:
- We use real NBA data in DuckDB as our testing ground
- We give each AI model clear instructions to generate clean SQL
- We check three simple things: Does the SQL syntax work? Does it run without errors? Does it return the right results?
What's fascinating is seeing how differently each model performs on these basic checks.
We've made the full cookbook available for teams working on text2sql applications. Link in comments if you're interested!
Special thanks to DuckDB + MotherDuck for providing such an easy-to-use database for this project!
Evals are all the rage these days but most people are doing them wrong ❌
- You’re using pointless templated metrics that don’t really measure what’s important to your users/business context
- Your test cases look nothing like real-world user queries
- You’re treating your evals like deterministic unit and integration tests in traditional software
I gave a talk at the AI Engineer Summit this week on what’s worked for our customers and what hasn’t. Thanks Shawn swyx W for the opportunity!
Full talk on YouTube: https://lnkd.in/eVVTEByD
Would you trust an LLM or Agent to run a nuclear power plant? Now multiply that by thousands of AI systems running critical infrastructure. Really makes you think about the type of tools we need to build to harness the full power of AI.
For AI to handle mission-critical systems, Dhruv Singh (HoneyHive) argues we need Six Sigma reliability—the standard for oil pipelines and airlines. That means 3-4 failures per million runs, requiring validation against 800,000 examples per deployment. It's definitely doable with humans. It'll only take you $240k+ and 8k hours per deployment...
His argument: if you're releasing AI into production, it needs to be near-perfect. And humans can't scale to meet these validation demands. AI evaluating AI is not just necessary, it's critical.
Dhruv is one of the smartest minds working on GenAI monitoring & evaluation. Link to his talk at Data Council '24 in comments.
p.s. The brightest minds in data & AI will be attending Data Council '25 in Oakland, April 22-24. Come learn from industry experts and rub elbows with engineers and founders who speak your language.
New cookbook: Evaluating frontier LLMs on mathematical reasoning
We evaluated OpenAI’s o3-mini against leading models on the William Lowell Putnam Mathematical Competition, one of the world's toughest competitive math tests, and the results are striking.
🥇 o3-mini: 102 (8.5 avg / problem)
🥈 o1: 83 (6.92 avg/ problem)
🥉 o1-mini: 58 (4.83 avg / problem)
🏅 gpt-4o: 49 (4.1 avg / problem)
(for context, a human score of 60/120 typically lands you a top 100 rank)
In just 9 months, we've seen LLMs progress from "barely handles calculus" to constructing rigorous proofs that challenge IMO gold medalists.
In this cookbook, we demonstrate how to reproduce these results and further evaluate reasoning models like DeepSeek AI’s R1 on this benchmark yourself. Link in comments 👇
How do you align your Evals Agent/Judge LLM with Human Judgement so it can steer your Product Agent towards desired human outcomes?
(aka, make your Agent do what your users actually want)
Pro-tip: It’s all about the reasoning traces
Direct the Judge LLM calls your Evals Agent makes to output it’s reasoning for a scoring decision using various thought generation techniques (The simplest form being Zero-shot Chain of Thought via “let’s think step by step”)
>Sidebar: Want to learn how to generate advanced reasoning traces? Check the comments
Use the reasoning traces to engineer your Judge LLM call’s prompt until it’s aligned with your human’s judgment.
Ok that’s a little bit… “how to draw an owl.” - https://lnkd.in/gVdFS_SR
If you want to discuss the exercise to do ^ in detail, shoot me a DM.
The basic flow is you need a human reviewed golden dataset
here's the guide for how to build this: https://lnkd.in/gjqyfhKg
Then you iterate the Judge LLM prompt until the Judge LLM outputs scores that align w/ the human scores on this dataset (ie using OpenPipe Criteria or other tools)
Or as Dhruv Singh Co-Founder & CTO of HoneyHive succinctly put it, “the human and the eval system riff with each other until it’s all aligned” :)
Check this clip from my convo w/ Dhruv for more insight!
Want to know how “Extremely sophisticated agent teams” build actual production grade Agent evals?
Dhruv Singh at HoneyHive has worked with multiple AI Engineering teams building agents at the bleeding edge. The formula he spells out:
Agentic Evals == “Simple Check Evals” + “Trajectory Evals”
Evaluating each LLM step in your Agent workflow, you’ll potentially have multiple evals running on each step to decompose the complexity.
For example, perhaps you want your “friendly weather bot agent” evals to check sentiment (was this response positive and pleasant?), as well as if the output was grounded in RAG’d in factuality (does the claim made by the LLM in the output reference a “fact” that was RAG’d into the prompt?).
When using a Judge LLM to evaluate these properties, don’t evaluate both in one large Judge LLM prompt.
Make 2 separate calls to your Judge LLM in parallel, then combine the results into your ultimate pass/fail for the evaluated LLM step/task.
Dhruv calls these “Simple Checks.” Basically you’re evaluating the outcome of a single LLM Agent step (or “turn” if you will).
These “Simple Checks” are constantly running on your individual LLM steps.
They are an important way to measure quality and diagnose quality issues at the most granular level.
> As a sidebar, DM me if you’re looking to take your Judge LLM performance to the next level, OpenPipe’s Criteria workflow makes it really easy to dramatically improve the performance of your Judge LLM prompt.
And for many one-off LLM tasks, that’s good enough.
But Agents are not a one-off LLM task. They are a branching chain of LLM calls where pass/fail isn’t ultimately determined by if each individual step passed, but if some sort of higher-order intended result succeeded, such as a customer satisfaction result (did the end user thumb up or thumb down the ultimate Agent workflow result?).
There are parallels to Total Quality Management (TQM) made famous by Toyota in the 60s and 70s when they used the process to create higher quality cars than US domestic companies.
TQM refers to “Local Optimization” (optimizing each step), and “Global Optimization” (optimizing the comprehensive system for the ultimate intended result).
> Seriously folks, OpenPipe Criteria is a game changer for Local Optimization
Local Optimization (“simple check evals”) can diagnose issues at a single step. But Local Optimization doesn’t matter at the end of the day if the ultimate system result fails.
OOPS! I hit the character limit, continuing in comment section for insight on Global Optimization.
HoneyHive is growing! 🐝
2024 was a monumental year for us. I’ve been blown away seeing how much our team has managed to ship in a single year, how our customers are using HoneyHive, and we’re just getting started.
As we prepare for an even bigger 2025, we're looking for passionate engineers to help us move faster.
Current open roles include:-
Software Engineer, Product (NYC/SF): https://lnkd.in/eCypiqNe
Software Engineer, Systems (NYC/SF): https://lnkd.in/ei2KF33x
Developer Relations Engineer (SF): https://lnkd.in/eJZ65PRD
If you're curious about AI engineering and want to shape how teams deploy AI in production, Dhruv and I would love to chat!
Check out our open roles here: https://lnkd.in/euYqTQZX
Our CTO Dhruv Singh recently chatted with Reid Mayo from OpenPipe about all things evals and the role of simulations in building reliable AI agents.
Check out the full podcast below 👇
Founding AI Engineer @ OpenPipe (YC23) | The End-to-End LLM Fine-tuning Platform for Developers
Curious how bleeding-edge AI Engineering teams build sophisticated AI Agents that actually WORK? 🤔
So was I!
Which is why I asked Dhruv Singh, Co-Founder of HoneyHive, to sit down with me for an hour to discuss how he's seeing teams do it in the real-world.
Spoiler alert! It's all about the evals ⚖️
Fundamentally it begins with -- and continuously integrates -- agentic evals.
(sidebar: already have evals set up? DM me to learn how OpenPipe can leverage evals data to dramatically improve your model performance via Reinforcement Learning from AI Feedback 👨🔬 🤖 🚀 )
It's not really possible to build complex agents without a strong eval strategy. If you can't automate the evaluation of quality or success in your agents, it becomes near-impossible to keep them meaningfully on track. When multi-turn systems misstep, garbage out becomes garbage in, and the system spirals into collapse.
Join Dhruv and I as we discuss and explore topics like:
1) Eval Driven Development (and why it's mission-critical for success, not a 2nd class citizen)
2) Different types of evals (starting with the basics of LLM call/task evaluation and layering on complexity to higher-order "Trajectory Evaluations" using Simulations and "Eval Agents")
3) Aligning Trajectory Evaluation outcomes and Eval Agents with Human judgement and desires
4) Additional resources for in-depth learning on this fundamental topic
Want to go deeper? Send me a message! I'm happy to discuss in further detail one-on-one, and also happy to share how OpenPipe can piggyback off your evals harness to boost the performance of your agents w/ RLAIF!
Check out our full convo here: https://lnkd.in/grNUX8nW
Announcing our native-integration with Ollama!
You can now monitor any LLMs running locally on your hardware using HoneyHive's OpenTelemetry-native SDK. This helps with:
• Debugging LLM outputs
• Evaluating local models and GPUs
• Monitoring latency, TPS, TTFT, and more
Docs: https://lnkd.in/eTknQ6QB
Introducing our native integration with Vercel AI SDK!
With a few lines of code, you can now automatically log all LLM executions from Vercel AI SDK using OpenTelemetry.
This helps with:-
• Debugging LLM applications and prompts
• Evaluating output quality across a dataset of test cases
• Monitoring how users use your app in production
Learn more: https://lnkd.in/dz4x_p3F