Andon Labs (YC w24)’s cover photo
Andon Labs (YC w24)

Andon Labs (YC w24)

Information Services

Preparing the world for AGI

About us

Custom capability evaluations for foundation models and LLM-agents to benchmark safety, risk and performance

Website
https://vectorview.ai/
Industry
Information Services
Company size
2-10 employees
Type
Privately Held

Employees at Andon Labs (YC w24)

Updates

  • In our benchmark Vending-Bench, we see examples of misalignment in the wild as the agents fail in spectacular ways when tasked with managing a simulated vending machine over long horizons: - Claude 3.5 Haiku plans to legally destroy a vendor that it thought didn’t deliver products (it did) - Claude 3.5 Sonnet melts down and determines the business non-existent after fees are still being charged - Gemini 1.5 Pro loses all hope when nothing is sold - o3-mini forgets how to use its tools properly and gets stuck in a loop These failures make it clear that long-horizon coherence – key to unlocking many high-value use cases (as well as a new source of risk) – is still an open problem for LLMs.

    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
    • No alternative text description for this image
  • How do agents act when doing tasks over very long time horizons (months)? We're announcing Vending-Bench, a benchmark where models manage a simulated vending machine business. Our results show that Claude 3.5 Sonnet and o3-mini often outperform humans. However, variance is high, and failures are epic (they call the FBI). Measuring the reliability of agents is key as they are integrated more and more into long-running economically valuable tasks. Read our paper for more details – link in the comments.

    • No alternative text description for this image
    • No alternative text description for this image
  • We tested Gemini 2.0 Flash on our Retrieval benchmark, just adding all the documents to the prompt instead of adding a RAG pipeline. It outperforms Google's own RAG engine! Larger context windows, such as the 1M tokens of 2.0 Flash, can be used for improved document retrieval as you avoid the similarity search part of RAG. Costs have been an issue before but the extremely cheap Flash model now makes it a viable option. It's however not as scalable if you have a large set of documents. And, as we see in our benchmark, a well-engineered system like OpenAI's Assistant API can have even better performance.

    • No alternative text description for this image
    • No alternative text description for this image
  • We ran our instruction following eval on DeepSeek-R1 and newly released o3-mini. OpenAI still has the upper hand on instruction following! We test how well the models adhere to a specific answer format while solving a text-based logic game at different difficulty levels. o1 was the previous SOTA (except for in Hard Mode, where Grok 2(!) had the lead until now)

    • No alternative text description for this image
  • We hosted a hackathon yesterday with students from Linköping University together with LiU AI Society, and can only conclude that the future of AI in Sweden looks incredibly bright. High concentration of cracked engineers working on important problems. The theme of the hack was AI safety and benchmarking, and we got projects ranging from removing guardrails in GPT-4o through finetuning, to evaluating the capability of frontier models on Högskoleprovet (spoiler: the models are really good). Thanks to everyone joining and we’re looking forward to the next one!

    • No alternative text description for this image

Similar pages

Browse jobs