UI-Vision

Desktop GUI 83 Apps Open Source Platform

Core Capabilities

Element Grounding

Layout Analysis

Action Prediction

The Challenge

Limited desktop environment focus

Lack of standardized benchmarks

Complex interface interactions

Our Solution

Expert-verified annotations

Diverse application coverage

Real-world automation tasks

Task Examples

OS: Windows 11

Platform: FreeCAD

Version: 0.20.2

Example Task

$ process_input --app="freecad" --input="data.json" --model="ui-tars-72b"

data.json (Step 2 of N)

{ "task_description": "Create a circle sketch and a 5.00mm pocket with the created sketch.", "current_step": "Navigate menus", "step_index": 2, "screenshot": "current_frame.png", "action_history": [ "CLICK(1058, 537)", "MOVE_TO(974, 472)", "DRAG_TO(949, 417)" ] }

1

Draw circle on XY plane

2

Select pocket operation

3

Choose circle as profile

4

Set pocket depth: 5.00mm

Agent Execution: Step-by-step CAD Task Automation

Visual Analysis

[ Required ] Detect and locate UI elements in interface

State Tracking

[ Required ] Monitor system state and element changes

Action Planning

[ Required ] Generate optimal step sequence

Required Capabilities: [ 1. Element Detection ] → [ 2. Action Sequencing ] → [ 3. Parameter Accuracy ] → [ 4. Task Completion ]

OS: Windows 11

Platform: Chrome

Version: 120.0

Example Task

$ process_input --app="chrome" --input="query.json" --model="ui-tars-72b"

query.json

{ "task_description": "Find and highlight the main navigation menu in the browser interface", "current_step": "Element grounding", "step_index": 1, "screenshot": "browser_interface.png", "query": "Locate the main navigation menu" }

Agent Response: Visual Grounding of UI Elements

Visual Understanding

[ Required ] Analyze and interpret UI layouts and elements

Element Detection

[ Required ] Identify and locate specific UI components

Spatial Reasoning

[ Required ] Understand relative positions and layouts

Required Capabilities: [ 1. Visual Analysis ] → [ 2. Element Detection ] → [ 3. Spatial Understanding ]

Action Space

Action	Description
Move(x, y)	Move the mouse to the specified coordinates.
Click(x, y, button)	Click the specified button at the given coordinates.
Typing('Hello')	Types a specified string.
Hotkey('ctrl', 'c')	Performs individual or combination hotkeys.
Drag([x1,y1], [x2,y2])	Drags the mouse from start (x1,y1) to end (x2,y2).

Platforms

Applications and tools analyzed in our research

Development (24.7%)

Geany Brackets IntelliJ IDEA PyCharm NetBeans Eclipse FreeCAD Atom VSCode DuckDuckGo

Browsers (4.7%)

Firefox Chrome Opera Brave Chromium

Productivity (32.9%)

LibreOffice Writer LibreOffice Draw LibreOffice Base LibreOffice Calc LibreOffice Impress OnlyOffice PDF Edit OnlyOffice Forms OnlyOffice Spreadsheet Gnumeric Notepad++ Simple Note CryptomatorWeb GnuCash Metabase Jitsi

Creativity (18.8%)

GIMP Inkscape Blender Nemo Flameshot

Entertainment (9.4%)

VLC Signal Element Matrix Emby Natron uTorrent qBittorrent KTorrent Anki Zulip

Dataset & Benchmark

Dataset Composition

83

Applications

6

Domains

8,200+

Query-Label Pairs

450+

Demonstrations

Each application is represented by multiple screenshots capturing different states and functionalities, ensuring comprehensive coverage of UI components and interaction patterns.

Scale & Diversity

Domain distribution across the UI-Vision dataset

Benchmark Tasks

Element Grounding

Find the save button Locate the volume slider Find the settings menu

Locates specific UI elements

Foundation for GUI interaction

Success Rate 31.4%

Layout Grounding

Highlight the main navigation menu Find the sidebar content area Locate the footer section

Groups functional regions

Tests GUI structure understanding

Success Rate 24.3%

Action Prediction

Create a new document and save it Open settings and enable dark mode Export the file as PDF format

Plans multi-step actions

Tests sequential decision-making

Success Rate 12.8%

Task Progression

Tasks increase in complexity from basic element identification to complex action sequences, enabling comprehensive evaluation of AI capabilities in GUI environments.

Results & Insights

Model Performance Overview

Open VLMs

2.8%

Base visual language models

Closed VLMs

8.3%

Proprietary models

Small GUI Agents

17.6%

<8B parameters

                
Large GUI Agents25.5%
>8B parameters

Model Size Matters

Larger models (50B+ parameters) consistently outperform smaller counterparts across all tasks

Specialization GUI Agents

GUI-specialized models show significant advantages over general-purpose visual language models

Task Complexity

Performance decreases with task complexity, with spatial understanding being the most challenging

Performance by Domain

Education

30.7%

Browsers

48.2%

Development

32.7%

Productivity

33.6%

Creativity

21.9%

Entertainment

51.2%

Detailed Performance Results

Model	Basic							Functional							Spatial							Final Avg
	Ed	Br	De	Pr	Cr	En	Overall	Ed	Br	De	Pr	Cr	En	Overall	Ed	Br	De	Pr	Cr	En	Overall
	(215)	(56)	(376)	(605)	(438)	(82)	(1772)	(215)	(56)	(376)	(605)	(438)	(82)	(1772)	(212)	(31)	(338)	(740)	(586)	(28)	(1935)
Closed-Source VLMs
GPT-4o	2.23	0.00	1.86	1.16	1.14	4.88	1.58	1.40	0.00	3.19	0.83	0.91	3.66	1.52	0.94	0.00	1.48	1.22	0.51	3.57	1.03	1.38
Claude-3.7-Sonnet	6.51	12.5	7.98	11.24	9.13	11.0	9.48	5.12	7.14	8.24	9.92	6.16	4.88	7.73	6.60	9.68	7.69	7.43	7.85	10.7	7.60	8.27
Open-Source VLMs
MiniCPM-V-8B	4.19	21.4	7.71	7.44	3.65	18.3	7.11	4.19	19.6	6.38	4.63	2.97	11.0	5.30	0.47	3.23	1.78	0.27	0.17	3.57	1.45	4.34
Open-Source GUI Agents (<8B)
UI-TARS-7B	15.4	41.1	21.8	21.2	13.2	39.0	20.1	20.5	41.1	25.5	26.5	16.0	45.1	24.3	6.60	12.9	11.0	9.2	5.8	17.9	8.37	17.6
Open-Source GUI Agents (>8B)
UI-TARS-72B	30.7	48.2	32.7	33.6	21.9	51.2	31.4	29.8	46.4	30.9	34.1	22.6	36.6	30.5	13.7	16.1	19.2	15.4	11.1	25.0	14.7	25.5

Table 1: Performance results across different settings and domains. Values shown are success rates (%). Domains: Ed (Education), Br (Browsers), De (Development), Pr (Productivity), Cr (Creativity), En (Entertainment). Numbers in parentheses indicate sample sizes.

Citation

If you find UI-Vision useful in your research, please consider citing our paper:

@misc{nayak2025uivisiondesktopcentricguibenchmark,
  title={UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction}, 
  author={Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Juan A. Rodriguez and 
          Montek Kalsi and Rabiul Awal and Nicolas Chapados and M. Tamer Özsu and 
          Aishwarya Agrawal and David Vazquez and Christopher Pal and Perouz Taslakian and 
          Spandana Gella and Sai Rajeswar},
  year={2025},
  eprint={2503.15661},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2503.15661}, 
}

Read Paper View on arXiv