A comprehensive collection of datasets, models, and research papers for advancing desktop automation and UI understanding
The first comprehensive, license-permissive evaluation benchmark for desktop computer use agents. Features 83 applications with dense, high-quality annotations including bounding boxes, action trajectories, and keyboard inputs.
A compact AI agent designed to achieve state-of-the-art performance on UI-Vision and OS-World-G benchmarks using precise, task-oriented supervision and the UI-Vision-Ground dataset.
Comprehensive datasets for building the next generation of intelligent computer automation agents
The largest human-annotated dataset for desktop GUI grounding with 3.5 million UI elements from 55,000 screenshots across 87 real-world applications.
Comprehensive action dataset with 10K+ detailed action trajectories, 70K+ atomic actions (Click, Drag, Press), and rich chain-of-thought traces for every user action.
The comprehensive ecosystem for building powerful computer use agents
Open source applications across 6 categories: Education, Browsers, Development, Productivity, Creativity, Entertainment
Complex multi-step workflows with detailed chain-of-thought reasoning for every action
60K screenshots densely annotated with precise bounding boxes and element metadata
High-quality screen recordings with synchronized action annotations and timing data
Research and projects using UI-Vision resources and building on top of them
Comprehensive survey of LLM-based GUI agents, discussing the evolution of visual perception and interaction in GUI environments.
Advances GUI agent grounding capabilities through self-evolutionary reinforcement learning, building on visual understanding benchmarks.
Establishes open foundations for computer-use agents, leveraging vision-language models for diverse computer task automation.
Using UI-Vision resources in your research? Share your work with the community.
Interested in collaborating or contributing to our research? We'd love to hear from you.