HoneyHive

HoneyHive

Software Development

New York, New York 677 followers

Modern AI Observability and Evaluation

About us

HoneyHive is the leading observability and evaluation platform for AI applications. From development to deployment, we make it easy for teams to debug, evaluate, and monitor AI applications and ship Generative AI products with confidence. HoneyHive’s founding team brings AI expertise from Microsoft and Meta, where they were involved with some of the earliest Generative AI projects. The company is based in New York and San Francisco.

Website
https://honeyhive.ai/
Industry
Software Development
Company size
2-10 employees
Headquarters
New York, New York
Type
Privately Held
Founded
2022

Locations

Employees at HoneyHive

Updates

  • HoneyHive reposted this

    View profile for Sunny B., graphic

    Engineering @ HoneyHive

    I recently built a Research & Report pipeline with Dhruv Singh at an SF hackathon. The goal was to constantly generate new research ideas, Elo rate them against each other, and allocate resources to research and write reports for the best ones, while redo or discard the worst ones. Here's our stack: Python, Exa for online research, Aider for writing the report, HoneyHive 🐝 for observability/datasets, o1-preview for long-context evals, and a custom inference proxy that will round robin various providers to handle rate limits. At HoneyHive, I'm thinking about how search systems like this could be monitored and evaluated. This will be increasingly relevant as inference becomes cheaper, faster, and better. Stay tuned for the demo! Will also open source the code.

    • No alternative text description for this image
  • View organization page for HoneyHive, graphic

    677 followers

    🚨 Introducing Composite Evaluators As AI agents become more complex, the number of evaluators can skyrocket. We've seen systems with 100+ evals! 📊😮 Composite Evaluators let you: - Combine multiple evaluators → Reduce noise in large-scale evals - Build multi-layer evaluation strategies → Quickly spot performance issues - Customize aggregations → Tailor to your specific use-case The best part? It works seamlessly with your existing Python, LLM, and human evaluators. Try today: https://www.honeyhive.ai/

    • No alternative text description for this image
  • View organization page for HoneyHive, graphic

    677 followers

    We evaluated OpenAI's o1-preview and o1-mini on the 2023 William Lowell Putnam Mathematical Competition, yielding fascinating results: 🥇 o1-preview: 79/120 (≈ Rank 9) 🥈 o1-mini: 73/120 (≈ Rank 19) => best cost/performance ratio 🥉 GPT-4o: 57/120 (≈ Rank 54) These scores are remarkable given the Putnam's renowned difficulty, where even partial solutions to a few problems can place a participant in the top 500. Our methodology involved two evaluation runs per model, using GPT-4o as an initial judge followed by human expert validation. All models showed significant improvement in their second attempts: ✅ o1-preview fully solved A1, A3, B4, and partially solved B2  ✅ o1-mini closely mirrored o1-preview's performance  ✅ GPT-4o fully solved A1 only While often reaching correct solutions, the models frequently lacked detailed step-by-step explanations, especially for proof-based problems. Both o1-preview and o1-mini were penalized on problem B2 for insufficient proof rigor, despite producing the correct answer. This highlights a potential limitation in articulating reasoning processes and needs further investigation. These results raise interesting questions for future research: 1️⃣ How does o1 approach proof-based problems versus computational ones?  2️⃣ What factors contribute to the models' improvement across multiple attempts?  3️⃣ How can we better evaluate o1’s internal reasoning without access to its internal Chain-of-Though reasoning steps? We encourage the community to explore these questions further and try to reproduce these results. Our evaluation methodology and code are available in our cookbook: https://lnkd.in/eu3j952G

    • No alternative text description for this image
  • View organization page for HoneyHive, graphic

    677 followers

    Just shipped: Discover, an interactive way to explore your LLM telemetry data 🔭 We heard from many users that static dashboards are completely useless for driving targeted, systematic improvements. Real progress requires drilling down into specifics, not just monitoring vanity metrics. Discover lets you slice and dice your data across hundreds of custom properties, spot trends, and query millions of logs instantly. Now you can: ✅ Drill down into specific model versions, prompts, user segments, and more ✅ Visualize historical trends across all your LLM traces and spans ✅ Uncover insights by correlating any LLM metrics ✅ Create custom charts and monitors to track key metrics Discover helps you drive systematic improvements of your AI applications, from refining prompts and retrieval architectures, to addressing critical performance bottlenecks. Try Discover today: https://lnkd.in/eGt84kg4

  • View organization page for HoneyHive, graphic

    677 followers

    We are excited to welcome Sidharth Prakash, our newest engineer, to the HoneyHive team! 💫 🚀 Sidharth just finished his Masters degree from NYU and is deeply passionate about AI infrastructure. Prior to joining HoneyHive, he built an ML-based document classification and processing system at J.P. Morgan, and dabbled in multiple AI side-projects during his free time. Welcome to the team, Sidharth!

    • No alternative text description for this image
  • HoneyHive reposted this

    View profile for Mohak Sharma, graphic

    Co-Founder and CEO at HoneyHive

    Datasets and evals are the veggies you need to eat 🥦 Many developers focus on optimizing prompts and models when building RAG applications, instead of focussing on retrieval quality. But in practice, most issues like hallucinations and refusals stem from poor context quality, not a bad system prompt or the model's inherent reasoning abilities. Here's why you should start by creating good evals and datasets for testing retrieval quality: ✅ Solid datasets reveal where your system fails to fetch relevant context so you can drive targeted improvements. ✅ You can't improve what you don't measure. Good metrics track progress over time and help guide improvements. ✅ Improving retrieval is usually faster than fine-tuning models and often leads to bigger gains (at least initially). Here's how can you do this: 1️⃣ Create question/context pairs that mimic real queries: Use real-world queries from users and use LLMs to generate synthetic questions based on your context chunks. Even 20-40 questions can be a good starting point. 2️⃣ Use deterministic metrics: Implement deterministic evaluators (cosine similarity, context precision/recall) before moving on to LLM-as-a-judge metrics like Context Relevance. 3️⃣ Focus on improving relevant contexts in top-k retrievals: Experiment with embedding models, chunk sizes, and indexing methods. Implement hybrid retrieval combining dense and sparse methods. Use re-ranking models and explore query augmentation techniques. 4️⃣ Implement systematic testing and continuous monitoring: Use tools like HoneyHive to automate evaluations, perform A/B tests, monitor queries in production, and measure progress. This creates a virtuous cycle of testing, monitoring, and improvement to help you drive systematic improvements. 5️⃣ Iterate and refine: Regularly update your dataset with new queries. Continuously retrain embedding models on your evolving corpus. Tune your evaluator prompt if you're using LLM-as-a-judge. Remember: Quality context is key. Evals help you systematically improve context quality, which is the key to solving downstream issues like hallucinations. By getting the right context, you're setting up your entire application for success – from retrieval all the way to the final output.

    • No alternative text description for this image
  • View organization page for HoneyHive, graphic

    677 followers

    LLM benchmarks are everywhere, but what do they actually mean? In our latest blog post, we dive into the nitty-gritty of LLM benchmarking, exploring: 🟢 The Good: Where benchmarks shine (hint: narrow, well-defined tasks and computational metrics) 🔴 The Bad: The in-sample vs. out-of-sample dilemma and the skill ceiling problem in pairwise comparisons 🟣 The Ugly: Data leakage - the phantom menace We also tackle the burning question: Are generic benchmarks hopeless for measuring quality on your specific use-case? (Spoiler: Yes, but that shouldn't stop you from trying!) Whether you're a developer wrestling with model selection or an AI leader trying to separate hype from reality, this post is your guide to reading between the lines of LLM benchmarks. Read the blog post: https://lnkd.in/dVrXwZHk

    What LLM Benchmarks Can and Cannot Tell You

    What LLM Benchmarks Can and Cannot Tell You

    honeyhive.ai

  • HoneyHive reposted this

    View profile for Mohak Sharma, graphic

    Co-Founder and CEO at HoneyHive

    Congrats to our customer MultiOn! 95.4% success on real OpenTable bookings (vs 18.6% Llama 3 baseline) is impressive! Great to see HoneyHive power their data flywheel - from tracing to data curation to fine-tuned model 💪

    View organization page for MultiOn, graphic

    5,302 followers

    Announcing our latest research breakthrough: 𝐀𝐠𝐞𝐧𝐭 𝐐 - bringing next-generation AI agents with planning and AI self-healing capabilities, with a 340% improvement over LLama 3's baseline zero-shot performance! In our real-world Agent Q experiments, we are reaching a success rate of 95.4% in autonomous web agent tasks. Read more about the paper: https://lnkd.in/dG7Nwqak

  • HoneyHive reposted this

    View profile for Mohak Sharma, graphic

    Co-Founder and CEO at HoneyHive

    Today, I’m super excited to announce our partnership with MongoDB! While prototyping RAG applications is easier than ever, production deployment remains challenging. Hallucinations, irrelevant context, and inaccurate responses plague real-world RAG systems as data and user behavior evolve. Enter MongoDB 🤝 HoneyHive 🔧 MongoDB Atlas: Robust vector storage and efficient retrieval capabilities  🍯 HoneyHive: Comprehensive evaluation and observability toolchain for GenAI applications Together, we enable developers to: 1️⃣ Build RAG systems with MongoDB Atlas Vector Search 2️⃣ Debug, evaluate, and optimize retrieval performance during development 3️⃣ Monitor both MongoDB infrastructure performance and end-to-end RAG quality in production Our joint capabilities give you the confidence to ship and continuously improve your RAG systems. Huge thanks to Gregory Maxson, Soumya Ranjan Pradhan, Maxwell Nardi, Kevin O'Rourke, Ashwin Gangadhar, and the entire MongoDB team for the fantastic collaboration! Get started today with our step-by-step tutorial: https://lnkd.in/e5XSuHys

    Towards Evaluation Driven Development with MongoDB and HoneyHive

    Towards Evaluation Driven Development with MongoDB and HoneyHive

    honeyhive.ai

Similar pages

Funding

HoneyHive 1 total round

Last Round

Pre seed
See more info on crunchbase