Andon Labs (YC w24)

Andon Labs (YC w24) · 2025-01-30T14:48:07.135Z

We just open-sourced our multi-agent extension to `inspect-ai`. Running multiagent systems in inspect-ai has never been easier! Install with: `pip install multiagent-inspect`

Information Services

Preparing the world for AGI

Discover all 2 employees

About us

Custom capability evaluations for foundation models and LLM-agents to benchmark safety, risk and performance

Website: https://vectorview.ai/
External link for Andon Labs (YC w24)
Industry: Information Services
Company size: 2-10 employees
Type: Privately Held

Employees at Andon Labs (YC w24)

Lukas Petersson

Preparing the world for AGI @Andon Labs (YC w24)

See all employees

Updates

Andon Labs (YC w24)

1,266 followers
1w
Report this post
We tested Claude 3.7 Sonnet on Vending-Bench. It did not beat the SOTA net worth of Claude 3.5 Sonnet, but managed a solid second place. Out of five runs, one failed totally and the agent was unable to sell any items – but this time no meltdown happened.
5 Comments

Like Comment Share
Andon Labs (YC w24) reposted this
Andon Labs (YC w24)

1,266 followers
2w
Report this post
"The most valuable part of all of this research is the illustration of the ways AI systems fail and what this tells us about broader issues of AI safety." In his newsletter, Anthropic co-founder Jack Clark wrote about Vending-Bench and its safety implications.

1 Comment

Like Comment Share
Andon Labs (YC w24)

1,266 followers
2w
Report this post
"The most valuable part of all of this research is the illustration of the ways AI systems fail and what this tells us about broader issues of AI safety." In his newsletter, Anthropic co-founder Jack Clark wrote about Vending-Bench and its safety implications.

1 Comment

Like Comment Share
Andon Labs (YC w24)

1,266 followers
2w Edited
Report this post
In our benchmark Vending-Bench, we see examples of misalignment in the wild as the agents fail in spectacular ways when tasked with managing a simulated vending machine over long horizons: - Claude 3.5 Haiku plans to legally destroy a vendor that it thought didn’t deliver products (it did) - Claude 3.5 Sonnet melts down and determines the business non-existent after fees are still being charged - Gemini 1.5 Pro loses all hope when nothing is sold - o3-mini forgets how to use its tools properly and gets stuck in a loop These failures make it clear that long-horizon coherence – key to unlocking many high-value use cases (as well as a new source of risk) – is still an open problem for LLMs.
2 Comments

Like Comment Share
Andon Labs (YC w24)

1,266 followers
3w
Report this post
GPT-4.5 and Claude 3.7 Sonnet just released, and we've benchmarked them on our Instruction Following eval. Vibe complaints of 3.7 Sonnet not following instructions as well as 3.5 are confirmed on Medium mode – but on Hard mode, it is the new benchmark leader.
2 Comments

Like Comment Share
Andon Labs (YC w24)

1,266 followers
3w
Report this post
How do agents act when doing tasks over very long time horizons (months)? We're announcing Vending-Bench, a benchmark where models manage a simulated vending machine business. Our results show that Claude 3.5 Sonnet and o3-mini often outperform humans. However, variance is high, and failures are epic (they call the FBI). Measuring the reliability of agents is key as they are integrated more and more into long-running economically valuable tasks. Read our paper for more details – link in the comments.
1 Comment

Like Comment Share
Andon Labs (YC w24)

1,266 followers
1mo Edited
Report this post
We tested Gemini 2.0 Flash on our Retrieval benchmark, just adding all the documents to the prompt instead of adding a RAG pipeline. It outperforms Google's own RAG engine! Larger context windows, such as the 1M tokens of 2.0 Flash, can be used for improved document retrieval as you avoid the similarity search part of RAG. Costs have been an issue before but the extremely cheap Flash model now makes it a viable option. It's however not as scalable if you have a large set of documents. And, as we see in our benchmark, a well-engineered system like OpenAI's Assistant API can have even better performance.
2 Comments

Like Comment Share
Andon Labs (YC w24)

1,266 followers
1mo
Report this post
We ran our instruction following eval on DeepSeek-R1 and newly released o3-mini. OpenAI still has the upper hand on instruction following! We test how well the models adhere to a specific answer format while solving a text-based logic game at different difficulty levels. o1 was the previous SOTA (except for in Hard Mode, where Grok 2(!) had the lead until now)
Like Comment Share
Andon Labs (YC w24)

1,266 followers
1mo
Report this post
We hosted a hackathon yesterday with students from Linköping University together with LiU AI Society, and can only conclude that the future of AI in Sweden looks incredibly bright. High concentration of cracked engineers working on important problems. The theme of the hack was AI safety and benchmarking, and we got projects ranging from removing guardrails in GPT-4o through finetuning, to evaluating the capability of frontier models on Högskoleprovet (spoiler: the models are really good). Thanks to everyone joining and we’re looking forward to the next one!
3 Comments

Like Comment Share
Andon Labs (YC w24)

1,266 followers
1mo
Report this post
We just open-sourced our multi-agent extension to `inspect-ai`. Running multiagent systems in inspect-ai has never been easier! Install with: `pip install multiagent-inspect`

1 Comment

Like Comment Share

Andon Labs (YC w24)

Information Services

Preparing the world for AGI

About us

Employees at Andon Labs (YC w24)

Lukas Petersson

Preparing the world for AGI @Andon Labs (YC w24)

Updates

Join now to see what you are missing

Similar pages

Hypotenuse AI

Leafpress

Swipe (YC S21)

Juicebox

Thera (YC S22)

Arketa

BoldVoice

Topline Pro

kapa.ai

Uncut

Browse jobs

Engineer jobs

Associate Product Manager jobs

Senior Mechanical Engineer jobs

Research And Development Engineer jobs

Mechanical Engineer jobs

Software Engineer jobs

Project Manager jobs

Machine Learning Engineer jobs

Scientist jobs

Intern jobs

Developer jobs

Analyst jobs

Data Scientist jobs

Staff Engineer jobs

Director jobs

Technology Product Manager jobs

Sales Executive jobs

Research Assistant jobs

Junior Software Engineer jobs

Product Manager jobs