We tested Claude 3.7 Sonnet on Vending-Bench. It did not beat the SOTA net worth of Claude 3.5 Sonnet, but managed a solid second place. Out of five runs, one failed totally and the agent was unable to sell any items – but this time no meltdown happened.
About us
- Website
-
https://vectorview.ai/
External link for Andon Labs (YC w24)
- Industry
- Information Services
- Company size
- 2-10 employees
- Type
- Privately Held
Employees at Andon Labs (YC w24)
Updates
-
Andon Labs (YC w24) reposted this
"The most valuable part of all of this research is the illustration of the ways AI systems fail and what this tells us about broader issues of AI safety." In his newsletter, Anthropic co-founder Jack Clark wrote about Vending-Bench and its safety implications.
-
"The most valuable part of all of this research is the illustration of the ways AI systems fail and what this tells us about broader issues of AI safety." In his newsletter, Anthropic co-founder Jack Clark wrote about Vending-Bench and its safety implications.
-
In our benchmark Vending-Bench, we see examples of misalignment in the wild as the agents fail in spectacular ways when tasked with managing a simulated vending machine over long horizons: - Claude 3.5 Haiku plans to legally destroy a vendor that it thought didn’t deliver products (it did) - Claude 3.5 Sonnet melts down and determines the business non-existent after fees are still being charged - Gemini 1.5 Pro loses all hope when nothing is sold - o3-mini forgets how to use its tools properly and gets stuck in a loop These failures make it clear that long-horizon coherence – key to unlocking many high-value use cases (as well as a new source of risk) – is still an open problem for LLMs.
-
-
How do agents act when doing tasks over very long time horizons (months)? We're announcing Vending-Bench, a benchmark where models manage a simulated vending machine business. Our results show that Claude 3.5 Sonnet and o3-mini often outperform humans. However, variance is high, and failures are epic (they call the FBI). Measuring the reliability of agents is key as they are integrated more and more into long-running economically valuable tasks. Read our paper for more details – link in the comments.
-
-
We tested Gemini 2.0 Flash on our Retrieval benchmark, just adding all the documents to the prompt instead of adding a RAG pipeline. It outperforms Google's own RAG engine! Larger context windows, such as the 1M tokens of 2.0 Flash, can be used for improved document retrieval as you avoid the similarity search part of RAG. Costs have been an issue before but the extremely cheap Flash model now makes it a viable option. It's however not as scalable if you have a large set of documents. And, as we see in our benchmark, a well-engineered system like OpenAI's Assistant API can have even better performance.
-
-
We ran our instruction following eval on DeepSeek-R1 and newly released o3-mini. OpenAI still has the upper hand on instruction following! We test how well the models adhere to a specific answer format while solving a text-based logic game at different difficulty levels. o1 was the previous SOTA (except for in Hard Mode, where Grok 2(!) had the lead until now)
-
-
We hosted a hackathon yesterday with students from Linköping University together with LiU AI Society, and can only conclude that the future of AI in Sweden looks incredibly bright. High concentration of cracked engineers working on important problems. The theme of the hack was AI safety and benchmarking, and we got projects ranging from removing guardrails in GPT-4o through finetuning, to evaluating the capability of frontier models on Högskoleprovet (spoiler: the models are really good). Thanks to everyone joining and we’re looking forward to the next one!
-