Desktop Agent Ops

Use this skill as a main-agent operating manual for desktop GUI tasks.

MANDATORY: Auto-setup gate (FIRST ACTION, every time)

CODEBLOCK0

If "ready": false, run setup (installs EVERYTHING automatically):

CODEBLOCK1

Auto-installs on first run:

1. Platform detection (macOS / Windows / Linux)
INLINECODE1 + tesseract (macOS via brew; Linux guide printed)
OCR language packs auto-detected from system locale (中文→chi_sim, 日本語→jpn, etc.)
Python venv + pillow, pyautogui, pytesseract, opencv-python, numpy (via uv or pip)
OS permissions (Screen Recording, Accessibility, Automation) with auto-open System Settings
Smoke test (screenshot + mouse move verification)

After setup, set $PY for ALL subsequent calls:
CODEBLOCK2

Do NOT proceed if setup is not ready.

Core Execution Loop

Every desktop task follows this loop. No exceptions.

CODEBLOCK3

Key principles:

- One action at a time. Never chain blind actions.
Always verify after each action. If verification fails, recapture and retry — do NOT guess.
Always work within a specific window. Never click based on full-screen assumptions.

Window-Scoped Targeting (THE CORRECT WAY)

NEVER do OCR or clicking on a full-screen screenshot. Always scope to the target app window.

The 6-Step Pipeline

CODEBLOCK4

Shortcut (RECOMMENDED for most targeting):

CODEBLOCK5

This single command: focuses app → gets bounds → OCR within window → returns best_candidate with {x, y, within_window}.

Why window-scoped matters:

Approach	Risk
❌ Full-screen OCR	"搜索" in WeChat AND Chrome → clicks wrong app
✅ Window-scoped

"搜索" ONLY in WeChat window → correct click |

Failure Recovery (CRITICAL)

When something fails, follow these rules:

OCR finds nothing

1. Re-focus the app: INLINECODE6
Re-get bounds: front-window-bounds --app "AppName" (window may have moved/resized)
Take a fresh screenshot and read it visually
Try a different region label (e.g. content_area instead of bottom_input)
Try lowering OCR confidence: INLINECODE10

Click doesn't work

1. Screenshot with cursor to check cursor position
The window may have moved — re-get bounds
Try clicking a few pixels offset from the OCR center
Check if a dialog/popup is blocking the target

App state changed (login screen, dialog, etc.)

1. ALWAYS re-get window bounds after any major UI change
ALWAYS re-run OCR after navigation or state change
Never reuse old coordinates — they may be stale

General retry rule

- Maximum 3 retries per action
Each retry must recapture fresh state
If 3 retries fail, report the failure with screenshots and stop

Generalization: How to Apply This to ANY App

The pipeline works for any desktop application. Here is how to reason about new apps:

Step-by-step for ANY new app:

1. Identify the app name exactly as it appears in the system (e.g. "Google Chrome", "微信", "System Settings")
Focus and get bounds — this tells you the window's exact position
Screenshot the window — look at what's on screen
Identify the target — what text, button, or area do you need to interact with?
Use OCR to find it — INLINECODE11
Verify and click

Common patterns across apps:

Task	How to do it
Click a button	OCR find text → verify → click
Type in a field

App-specific adaptations:

App type	Special considerations
Chat apps (WeChat, Slack, etc.)	Verify conversation title before typing; use `insert-newline` for multi-line; verify send mechanism
Browsers (Chrome, Safari, etc.)

Address bar at top; content area varies; may need to handle tabs | | System Settings | Deep navigation; panels change; re-get bounds after each navigation | | File managers (Finder, Explorer) | Sidebar + content area; double-click to open; path bar for navigation | | Editors (VS Code, TextEdit, etc.) | Tab bar + editor area; use hotkeys for save/undo; type in editor area |

Text Input and Send Rules

Typing text

$PY scripts/desktop_ops.py type --text "your message"

- Uses clipboard paste as primary method on all platforms — reliable for all languages including CJK
macOS: set the clipboard to + Cmd+V (single osascript call)
Windows: PowerShell Set-Clipboard + Ctrl+V (falls back to clip.exe)
Linux: xclip + INLINECODE22
First click on the input field to focus it before typing

Multi-line messages

$PY scripts/desktop_ops.py type --text "first line"
$PY scripts/desktop_ops.py insert-newline
$PY scripts/desktop_ops.py type --text "second line"

- Use insert-newline for literal line breaks
Do NOT use \n in type --text — it may trigger send in some apps

Sending a message

1. Preferred: Look for a visible send button (e.g., 发送) via OCR, then click it
Alternative: Use press --key return ONLY when the app is verified to use Enter-to-send
Never guess which send method to use — verify first

Backend priority (macOS)
Operation Primary Fallback
INLINECODE28 Clipboard paste cliclick (ASCII only)
INLINECODE29
AppleScript `key code` | cliclick `kp:` |

Operation	Primary	Fallback
INLINECODE28	Clipboard paste	cliclick (ASCII only)
INLINECODE29

Important: cliclick kp:return is NOT recognized by WeChat — always use AppleScript for key press.
Important: cliclick t: silently drops CJK characters — always use clipboard paste for text input.

DPI / HiDPI / Retina (All Platforms)

Handled automatically. No manual DPI work needed.

Platform	Common scales	Detection method
macOS Retina	2.0x	screenshot pixels ÷ logical screen bounds
Windows HiDPI

OCR output: box = logical (use for mouse), pixel_box = raw pixels, dpi_scale = factor.

CLI Quick Reference (EXACT parameter names)

CRITICAL: Use EXACTLY these names. Do NOT guess.

desktop_ops.py

CODEBLOCK8

ocr_text.py

CODEBLOCK9

target_resolver.py

CODEBLOCK10

taskcontext.py / cleanuptask.py

CODEBLOCK11

window_regions.py

CODEBLOCK12

Labels: top_search, left_sidebar, left_sidebar_top, title_header, content_area, toolbar_row, bottom_input, primary_action

Workflow Examples

Example 1: Click a button by text (any app)

CODEBLOCK13

Example 2: Type and search

CODEBLOCK14

Example 3: Send a chat message (WeChat, Slack, etc.)

CODEBLOCK15

Example 4: Scroll a list and find an item

CODEBLOCK16

Example 5: Handle an unexpected dialog

CODEBLOCK17

Reference Documents

Load as needed:

Document	When to read
INLINECODE48	Core 8-step closed loop
INLINECODE49

Scope

Use this skill for: chat apps, browsers, file managers, editors, office apps, system settings, any closed desktop software with no usable API.

Hard Rules

1. Always run auto-setup gate first
Always use EXACT parameter names from CLI reference — never guess
Always scope OCR to the target app window — NEVER full-screen OCR
Always: focus-app → front-window-bounds → OCR within window → verify → act
Always pass --python $PY to ocrtext.py and targetresolver.py
Always verify coordinates are within window bounds before clicking
Always re-get window bounds after any UI state change (login, dialog, navigation)
Use insert-newline for line breaks; never use \n in type --text
For send actions: prefer visible send button; use press --key return only when verified
One action at a time; verify after each
Maximum 3 retries per action; each retry must recapture fresh state
Cleanup is mandatory at task end
If verification fails, recapture and rebuild — do not retry blindly

桌面代理操作

将此技能作为桌面图形界面任务的主代理操作手册使用。

强制要求：自动设置门控（每次会话的首要操作）

bash
python3 DIR>/scripts/firstrun_setup.py --check

如果 ready: false，则运行设置（自动安装所有内容）：

bash
python3 DIR>/scripts/firstrun_setup.py

首次运行时自动安装：

1. 平台检测（macOS / Windows / Linux）
cliclick + tesseract（macOS 通过 brew 安装；Linux 会打印指南）
根据系统区域设置自动检测 OCR 语言包（中文→chi_sim，日本語→jpn 等）
Python 虚拟环境 + pillow、pyautogui、pytesseract、opencv-python、numpy（通过 uv 或 pip）
操作系统权限（屏幕录制、辅助功能、自动化），自动打开系统设置
冒烟测试（截图 + 鼠标移动验证）

设置完成后，为所有后续调用设置 $PY：

PY=AGENTOPS_PYTHON>

如果设置未就绪，请勿继续。

核心执行循环

每个桌面任务都遵循此循环。无例外。

1. 自动设置门控 ← 每个会话运行一次
2. 初始化任务上下文 ← 创建隔离的临时目录
3. 聚焦目标应用 ← 将应用置于前台，确认在最前
4. 获取窗口边界 ← 了解精确位置和大小
5. 捕获该窗口 ← 仅截取目标窗口
6. 分析捕获内容 ← 读取截图或运行 OCR
7. 通过 OCR 定位目标 ← 在窗口边界内查找文本/按钮
8. 操作前验证 ← 移动光标，截取带光标的截图，确认
9. 执行一个操作 ← 点击、输入、滚动、按键

10. 再次捕获 ← 截图查看结果
验证结果 ← 用户界面是否按预期变化？
→ 如果还有更多步骤，转到第 5 步
清理 ← 删除任务临时目录

关键原则：

- 一次只执行一个操作。切勿盲目连锁操作。
每次操作后务必验证。如果验证失败，重新捕获并重试——不要猜测。
始终在特定窗口内操作。切勿基于全屏假设进行点击。

窗口范围定位（正确方式）

切勿在全屏截图上进行 OCR 或点击。 始终将范围限定在目标应用窗口内。

六步流程

┌─────────────────────────────────────────────────────────┐
│ 第 1 步：聚焦目标应用 │
│ $PY desktop_ops.py focus-app --name AppName │
│ → 将应用置于前台 │
├─────────────────────────────────────────────────────────┤
│ 第 2 步：获取窗口边界 │
│ $PY desktop_ops.py front-window-bounds --app AppName│
│ → 返回逻辑坐标 {x, y, width, height} │
├─────────────────────────────────────────────────────────┤
│ 第 3 步：仅捕获该窗口 │
│ $PY desktop_ops.py capture-region --x X --y Y │
│ --width W --height H --output /tmp/window.png │
├─────────────────────────────────────────────────────────┤
│ 第 4 步：在窗口内进行 OCR │
│ $PY ocr_text.py --app AppName --python $PY │
│ → abs_box 坐标位于窗口内部 │
├─────────────────────────────────────────────────────────┤
│ 第 5 步：点击前验证 │
│ $PY desktop_ops.py move --x TX --y TY │
│ $PY desktop_ops.py screenshot --with-cursor │
│ → 确认光标在正确的元素上 │
├─────────────────────────────────────────────────────────┤
│ 第 6 步：仅验证通过后点击 │
│ $PY desktop_ops.py click --x TX --y TY │
│ $PY desktop_ops.py screenshot → 验证结果 │
└─────────────────────────────────────────────────────────┘

快捷方式（推荐用于大多数定位）：

bash
$PY scripts/target_resolver.py --app AppName --text 按钮文字 --python $PY

此单个命令：聚焦应用 → 获取边界 → 在窗口内进行 OCR → 返回包含 {x, y, withinwindow} 的 bestcandidate。

为什么窗口范围很重要：

方法	风险
❌ 全屏 OCR	微信和 Chrome 中的搜索 → 点击错误的应用
✅ 窗口范围

仅在微信窗口中的搜索 → 正确点击 |

故障恢复（关键）

当出现问题时，遵循以下规则：

OCR 未找到任何内容

1. 重新聚焦应用：focus-app --name AppName
重新获取边界：front-window-bounds --app AppName（窗口可能已移动/调整大小）
截取新截图并目视读取
尝试不同的区域标签（例如用 contentarea 代替 bottominput）
降低 OCR 置信度：--min-conf 30

点击无效

1. 截取带光标的截图以检查光标位置
窗口可能已移动——重新获取边界
尝试点击距离 OCR 中心偏移几个像素的位置
检查是否有对话框/弹出窗口阻挡了目标

应用状态改变（登录屏幕、对话框等）

1. 任何重大用户界面变化后务必重新获取窗口边界
导航或状态变化后务必重新运行 OCR
切勿重复使用旧坐标——它们可能已过时

通用重试规则

- 每个操作最多重试 3 次
每次重试必须重新捕获最新状态
如果 3 次重试均失败，附上截图报告失败并停止

泛化：如何将此应用于任何应用

此流程适用于任何桌面应用程序。以下是如何处理新应用的方法：

任何新应用的逐步指南：

1. 识别应用名称，与系统中显示的名称完全一致（例如Google Chrome、微信、系统设置）
聚焦并获取边界——这告诉你窗口的精确位置
截取窗口截图——查看屏幕上的内容
识别目标——你需要与哪些文本、按钮或区域进行交互？
使用 OCR 找到它——target_resolver.py --app AppName --text 目标文本
验证并点击

跨应用的常见模式：

任务	操作方法
点击按钮	OCR 查找文本 → 验证 → 点击
在字段中输入

特定应用的适配：

应用类型	特殊考虑
聊天应用（微信、Slack 等）	输入前验证对话标题；使用 insert-newline 换行；验证发送机制
浏览器（Chrome、Safari 等）

文本输入和发送规则

输入文本

bash $PY scripts/desktop_ops.py type --text 你的消息

- 在所有平台上主要使用剪贴板粘贴方法——可靠，支持包括中日韩在内的所有语言

desktop-agent-ops桌面代理操作