AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Chris Rawles*1, Yifan Chang†2, Sarah Clinckemaillie†2, Jonathan Waltz2, Gabrielle Lau2, Marybeth Fair2, Robert Berry1, Wei Li1, Will Bishop1, Alice Li1, Folawiyo Campbell-Ajala1, Divya Tyamagundlu2, Daniel Toyama1, Timothy Lillicrap1, Oriana Riva1
1 Google DeepMind
2 Google
*Lead contributor
†Equal contribution
Dataset
Key Features:
- 📝 116 diverse tasks across 20 real-world apps
- 🎲 Dynamic task instantiation for millions of unique variations
- 🏆 Durable reward signals for reliable evaluation
- 🌐 Open environment with access to millions of Android apps and websites
- 💾 Lightweight footprint (2 GB memory, 8 GB disk)
- 🔧 Extensible design to easily add new tasks and benchmarks
- 🖥️ Integration with MiniWoB++ web-based tasks
Dataset Statistics
The distribution tags across AndroidWorld tasks
The distribution of the number of steps taken to perform tasks
Comparison to other datasets
Citation
@misc{rawles2024androidworld, title={AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents}, author={Christopher Rawles and Sarah Clinckemaillie and Yifan Chang and Jonathan Waltz and Gabrielle Lau and Marybeth Fair and Alice Li and William Bishop and Wei Li and Folawiyo Campbell-Ajala and Daniel Toyama and Robert Berry and Divya Tyamagundlu and Timothy Lillicrap and Oriana Riva}, year={2024}, eprint={2405.14573}, archivePrefix={arXiv}, primaryClass={cs.AI} }