UI-Vision Logo

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

*Equal contribution 1Mila - Quebec AI Institute 2Université de Montréal 3ServiceNow 4University of Waterloo 5National University of Singapore 6École de Technologie Supérieure 7CIFAR AI Chair 8Polytechnique Montréal
Correspondence to: Shravan Nayak <shravan.nayak@mila.quebec>, Xiangru Jian <x2jian@uwaterloo.ca>

UI-Vision

Desktop GUI 83 Apps Open Source Platform

Core Capabilities

Element Grounding
Layout Analysis
Action Prediction
UI-Vision Benchmark Overview
The Challenge
Limited desktop environment focus
Lack of standardized benchmarks
Complex interface interactions
Our Solution
Expert-verified annotations
Diverse application coverage
Real-world automation tasks

Task Examples

Example Task
$ process_input --app="freecad" --input="data.json" --model="ui-tars-72b"
data.json (Step 2 of N)
{ "task_description": "Create a circle sketch and a 5.00mm pocket with the created sketch.", "current_step": "Navigate menus", "step_index": 2, "screenshot": "current_frame.png", "action_history": [ "CLICK(1058, 537)", "MOVE_TO(974, 472)", "DRAG_TO(949, 417)" ] }
1
Draw circle on XY plane
2
Select pocket operation
3
Choose circle as profile
4
Set pocket depth: 5.00mm
CAD Task Example
Agent Execution: Step-by-step CAD Task Automation
Visual Analysis
[ Required ] Detect and locate UI elements in interface
State Tracking
[ Required ] Monitor system state and element changes
Action Planning
[ Required ] Generate optimal step sequence
Required Capabilities: [ 1. Element Detection ] [ 2. Action Sequencing ] [ 3. Parameter Accuracy ] [ 4. Task Completion ]
Example Task
$ process_input --app="chrome" --input="query.json" --model="ui-tars-72b"
query.json
{ "task_description": "Find and highlight the main navigation menu in the browser interface", "current_step": "Element grounding", "step_index": 1, "screenshot": "browser_interface.png", "query": "Locate the main navigation menu" }
Grounding Example
Agent Response: Visual Grounding of UI Elements
Visual Understanding
[ Required ] Analyze and interpret UI layouts and elements
Element Detection
[ Required ] Identify and locate specific UI components
Spatial Reasoning
[ Required ] Understand relative positions and layouts
Required Capabilities: [ 1. Visual Analysis ] [ 2. Element Detection ] [ 3. Spatial Understanding ]

Action Space

Action Description
Move(x, y) Move the mouse to the specified coordinates.
Click(x, y, button) Click the specified button at the given coordinates.
Typing('Hello') Types a specified string.
Hotkey('ctrl', 'c') Performs individual or combination hotkeys.
Drag([x1,y1], [x2,y2]) Drags the mouse from start (x1,y1) to end (x2,y2).

Platforms

Applications and tools analyzed in our research

Development (24.7%)

Geany Brackets IntelliJ IDEA PyCharm NetBeans Eclipse FreeCAD Atom VSCode DuckDuckGo

Browsers (4.7%)

Firefox Chrome Opera Brave Chromium

Productivity (32.9%)

LibreOffice Writer LibreOffice Draw LibreOffice Base LibreOffice Calc LibreOffice Impress OnlyOffice PDF Edit OnlyOffice Forms OnlyOffice Spreadsheet Gnumeric Notepad++ Simple Note CryptomatorWeb GnuCash Metabase Jitsi

Creativity (18.8%)

GIMP Inkscape Blender Nemo Flameshot

Entertainment (9.4%)

VLC Signal Element Matrix Emby Natron uTorrent qBittorrent KTorrent Anki Zulip

Dataset & Benchmark

Dataset Composition

83
Applications
6
Domains
8,200+
Query-Label Pairs
450+
Demonstrations

Each application is represented by multiple screenshots capturing different states and functionalities, ensuring comprehensive coverage of UI components and interaction patterns.

Scale & Diversity
Dataset Domain Distribution

Domain distribution across the UI-Vision dataset

Benchmark Tasks

Element Grounding

Find the save button Locate the volume slider Find the settings menu
star
Locates specific UI elements
Foundation for GUI interaction
Success Rate 31.4%

Layout Grounding

Highlight the main navigation menu Find the sidebar content area Locate the footer section
star
Groups functional regions
Tests GUI structure understanding
Success Rate 24.3%

Action Prediction

Create a new document and save it Open settings and enable dark mode Export the file as PDF format
star
Plans multi-step actions
Tests sequential decision-making
Success Rate 12.8%

Task Progression

Tasks increase in complexity from basic element identification to complex action sequences, enabling comprehensive evaluation of AI capabilities in GUI environments.

Results & Insights

Model Performance Overview

Open VLMs

2.8%
Base visual language models

Closed VLMs

8.3%
Proprietary models

Small GUI Agents

17.6%
<8B parameters

Large GUI Agents

25.5%
>8B parameters

Model Size Matters

Larger models (50B+ parameters) consistently outperform smaller counterparts across all tasks

Specialization GUI Agents

GUI-specialized models show significant advantages over general-purpose visual language models

Task Complexity

Performance decreases with task complexity, with spatial understanding being the most challenging

Performance by Domain

Education
30.7%
Browsers
48.2%
Development
32.7%
Productivity
33.6%
Creativity
21.9%
Entertainment
51.2%

Detailed Performance Results

Model Basic Functional Spatial Final Avg
Ed Br De Pr Cr En Overall Ed Br De Pr Cr En Overall Ed Br De Pr Cr En Overall
(215) (56) (376) (605) (438) (82) (1772) (215) (56) (376) (605) (438) (82) (1772) (212) (31) (338) (740) (586) (28) (1935)
Closed-Source VLMs
GPT-4o 2.230.001.861.161.144.881.58 1.400.003.190.830.913.661.52 0.940.001.481.220.513.571.03 1.38
Claude-3.7-Sonnet 6.5112.57.9811.249.1311.09.48 5.127.148.249.926.164.887.73 6.609.687.697.437.8510.77.60 8.27
Open-Source VLMs
MiniCPM-V-8B 4.1921.47.717.443.6518.37.11 4.1919.66.384.632.9711.05.30 0.473.231.780.270.173.571.45 4.34
Open-Source GUI Agents (<8B)
UI-TARS-7B 15.441.121.821.213.239.020.1 20.541.125.526.516.045.124.3 6.6012.911.09.25.817.98.37 17.6
Open-Source GUI Agents (>8B)
UI-TARS-72B 30.748.232.733.621.951.231.4 29.846.430.934.122.636.630.5 13.716.119.215.411.125.014.7 25.5

Table 1: Performance results across different settings and domains. Values shown are success rates (%). Domains: Ed (Education), Br (Browsers), De (Development), Pr (Productivity), Cr (Creativity), En (Entertainment). Numbers in parentheses indicate sample sizes.

Citation

If you find UI-Vision useful in your research, please consider citing our paper:

@misc{nayak2025uivisiondesktopcentricguibenchmark,
  title={UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction}, 
  author={Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Juan A. Rodriguez and 
          Montek Kalsi and Rabiul Awal and Nicolas Chapados and M. Tamer Özsu and 
          Aishwarya Agrawal and David Vazquez and Christopher Pal and Perouz Taslakian and 
          Spandana Gella and Sai Rajeswar},
  year={2025},
  eprint={2503.15661},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2503.15661}, 
}