UI-Vision Suite - Datasets, Models, and Papers

Featured Projects

[LOADED] 2/2 ACTIVE ● READY

Featured

UI-Vision

● Published ICML 2024 83 Apps

The first comprehensive, license-permissive evaluation benchmark for desktop computer use agents. Features 83 applications with dense, high-quality annotations including bounding boxes, action trajectories, and keyboard inputs.

83 Apps

8.2K+ Query-Label Pairs

6 Domains

Desktop Benchmark Visual Perception Interaction Evaluation

View Project Paper

PROJECT STATUS ● ACTIVE

> Loading project configuration... > Status: ONLINE > Last updated:

New

StarUI

● In Development Coming Soon

A compact AI agent designed to achieve state-of-the-art performance on UI-Vision and OS-World-G benchmarks using precise, task-oriented supervision and the UI-Vision-Ground dataset.

3.5M+ Elements

470K Instruction Pairs

87 Applications

Desktop Agents Grounding Instruction Following

UI-Vision Suite Datasets

[IN DEVELOPMENT] 2 COMPONENTS ● COMING SOON

Comprehensive datasets for building the next generation of intelligent computer automation agents

Featured

UI-Vision-Ground

The largest human-annotated dataset for desktop GUI grounding with 3.5 million UI elements from 55,000 screenshots across 87 real-world applications.

3.5M+ UI Elements

55K Screenshots

87 Applications

Element Grounding Dense Annotations Bounding Boxes Human Annotated

Coming Soon

UI-Vision-Actions

Comprehensive action dataset with 10K+ detailed action trajectories, 70K+ atomic actions (Click, Drag, Press), and rich chain-of-thought traces for every user action.

10K+ Tasks

70K+ Actions

87 Platforms

Chain-of-Thought Atomic Actions Action Planning

UI-Vision Suite Ecosystem

[COMPLETE] ONE-STOP SHOP ● READY

The comprehensive ecosystem for building powerful computer use agents

87 Diverse Platforms

Open source applications across 6 categories: Education, Browsers, Development, Productivity, Creativity, Entertainment

Coverage: 6 major software categories

Diversity: From simple calculators to complex IDEs

Real-world: Actual user workflows and scenarios

10K+ User Tasks

Complex multi-step workflows with detailed chain-of-thought reasoning for every action

Actions: 70K+ atomic actions (Click, Drag, Press, etc.)

Reasoning: CoT traces for every user decision

Example: "I need to access settings → Click the 3-line menu → Navigate to privacy..."

3.5M Elements

60K screenshots densely annotated with precise bounding boxes and element metadata

Density: ~58 elements per screenshot average

Precision: Pixel-perfect bounding boxes

Metadata: Element names, types, and relationships

10K+ Videos

High-quality screen recordings with synchronized action annotations and timing data

Quality: HD recordings with precise timestamps

Annotations: Action coordinates and element interactions

Training: Ready for behavioral cloning and imitation learning

Community Usage & Extensions

Research and projects using UI-Vision resources and building on top of them

"Large Language Model-brained GUI Agents: A Survey"

Survey

Zhang et al., arXiv 2024

Comprehensive survey of LLM-based GUI agents, discussing the evolution of visual perception and interaction in GUI environments.

Paper

"Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning"

Research

Yuan et al., arXiv 2025

Advances GUI agent grounding capabilities through self-evolutionary reinforcement learning, building on visual understanding benchmarks.

Paper

"OpenCUA: Open Foundations for Computer-Use Agents"

Framework

Wang et al., arXiv 2025

Establishes open foundations for computer-use agents, leveraging vision-language models for diverse computer task automation.

Paper

Submit Your Work

Open

Community Contributions

Using UI-Vision resources in your research? Share your work with the community.

Submit

Get Involved

Interested in collaborating or contributing to our research? We'd love to hear from you.

Contact Us GitHub