🚀 Personal assistants of the future will be able to operate computers just like humans — by controlling user interfaces. To help make this vision a reality, we are excited to introduce AndroidWorld, a new benchmark for building and evaluating computer control agents. 🌍📱
Why AndroidWorld? Last year, we released AitW (https://lnkd.in/eQJpWzDa), a large-scale dataset for Android control collected from human experts during the pre-LLM era. However, while creating this, we discovered that to effectively evaluate automation agents, we need interactive, real-world environments.
AndroidWorld is an open environment that offers:
📝 116 diverse tasks across 20 real-world apps
🎲 Dynamic task instantiation for millions of unique variations
🏆 Durable reward signals for reliable evaluation
Key features of AndroidWorld:
🌐 Access to millions of Android apps and websites
💾 Lightweight footprint (2 GB memory, 8 GB disk)
🔧 Extensible design for easy addition of new tasks and benchmarks
🖥️ Integration with MiniWoB++ web-based tasks
To evaluate AndroidWorld, we built M3A, a new agent optimized for Android. M3A can operate dozens of apps, showcasing the potential of multimodal models like Gemini and GPT-4 to create agents out of the box. Our best-performing agent achieves a 30.6% success rate zero-shot, compared to 80% for humans, demonstrating significant room for improvement.
🎥 Every time a task is created in AndroidWorld, it is dynamically instantiated, controlled by a random seed, similar to MiniWoB++. We found that agents are highly sensitive to these random seeds, impacting their performance.
🔗 Learn more about AndroidWorld and M3A:
📄 Paper: https://lnkd.in/eUUDwD83
💻 GitHub Repo: https://lnkd.in/e2_VWNXg
🌐 Project page: https://lnkd.in/eF7K8Snj
P.S: We also tested a Web Agent on AndroidWorld to explore cross-domain automation. While web agents can solve Android tasks, they significantly lag behind Android-optimized agents like M3A. This highlights the importance of developing domain-specific agents for optimal performance.
cc: Divya Tyam, Sarah Clinckemaillie, YIFAN CHANG, Marybeth Fair, Oriana Riva, Robert Berry, William Bishop, Alice Li, Folawiyo Campbell-Ajala, Wei Li,
#AI #MachineLearning #AndroidWorld #Automation #Research #ArtificialIntelligence #Benchmarking #AIResearch #ComputerControlAgents
Product Strategy Consultant for SaaS and API Companies|3x VP PM
3moI like the term "agent-friendly API" and think that as API providers we should work towards them. As an industry we're currently working out what that means, but I think it's safe to say that APIs will serve mostly agents in the future and will not need to be optimized for human developers. Right now APIs are 100% the latter and none of the former. As someone who's worked on API Design and Developer Experience, it's both scary and exciting - but inevitable.