Desktop Agent Ops
Use this skill as a main-agent operating manual for desktop GUI tasks.
MANDATORY: Auto-setup gate (FIRST ACTION, every time)
CODEBLOCK0
If "ready": false, run setup (installs EVERYTHING automatically):
CODEBLOCK1
Auto-installs on first run:
- 1. Platform detection (macOS / Windows / Linux)
- INLINECODE1 +
tesseract (macOS via brew; Linux guide printed) - OCR language packs auto-detected from system locale (中文→chi_sim, 日本語→jpn, etc.)
- Python venv + pillow, pyautogui, pytesseract, opencv-python, numpy (via uv or pip)
- OS permissions (Screen Recording, Accessibility, Automation) with auto-open System Settings
- Smoke test (screenshot + mouse move verification)
After setup, set $PY for ALL subsequent calls:
CODEBLOCK2
Do NOT proceed if setup is not ready.
Core Execution Loop
Every desktop task follows this loop. No exceptions.
CODEBLOCK3
Key principles:
- - One action at a time. Never chain blind actions.
- Always verify after each action. If verification fails, recapture and retry — do NOT guess.
- Always work within a specific window. Never click based on full-screen assumptions.
Window-Scoped Targeting (THE CORRECT WAY)
NEVER do OCR or clicking on a full-screen screenshot. Always scope to the target app window.
The 6-Step Pipeline
CODEBLOCK4
Shortcut (RECOMMENDED for most targeting):
CODEBLOCK5
This single command: focuses app → gets bounds → OCR within window → returns best_candidate with {x, y, within_window}.
Why window-scoped matters:
| Approach | Risk |
|---|
| ❌ Full-screen OCR | "搜索" in WeChat AND Chrome → clicks wrong app |
| ✅ Window-scoped |
"搜索" ONLY in WeChat window → correct click |
Failure Recovery (CRITICAL)
When something fails, follow these rules:
OCR finds nothing
- 1. Re-focus the app: INLINECODE6
- Re-get bounds:
front-window-bounds --app "AppName" (window may have moved/resized) - Take a fresh screenshot and read it visually
- Try a different region label (e.g.
content_area instead of bottom_input) - Try lowering OCR confidence: INLINECODE10
Click doesn't work
- 1. Screenshot with cursor to check cursor position
- The window may have moved — re-get bounds
- Try clicking a few pixels offset from the OCR center
- Check if a dialog/popup is blocking the target
App state changed (login screen, dialog, etc.)
- 1. ALWAYS re-get window bounds after any major UI change
- ALWAYS re-run OCR after navigation or state change
- Never reuse old coordinates — they may be stale
General retry rule
- - Maximum 3 retries per action
- Each retry must recapture fresh state
- If 3 retries fail, report the failure with screenshots and stop
Generalization: How to Apply This to ANY App
The pipeline works for any desktop application. Here is how to reason about new apps:
Step-by-step for ANY new app:
- 1. Identify the app name exactly as it appears in the system (e.g. "Google Chrome", "微信", "System Settings")
- Focus and get bounds — this tells you the window's exact position
- Screenshot the window — look at what's on screen
- Identify the target — what text, button, or area do you need to interact with?
- Use OCR to find it — INLINECODE11
- Verify and click
Common patterns across apps:
| Task | How to do it |
|---|
| Click a button | OCR find text → verify → click |
| Type in a field |
OCR find field label → click field →
type --text |
| Search for something | OCR find search box → click → type query → press return |
| Scroll a list | Get window bounds → scroll at window center with
--x --y |
| Switch between apps |
focus-app --name "OtherApp" → re-get bounds |
| Handle a dialog | Screenshot → OCR for dialog buttons → click appropriate one |
| Navigate menus | Click menu item → wait → screenshot → OCR new menu → click |
| Select from dropdown | Click dropdown → wait → OCR options → click selection |
| Read screen content | OCR the window → extract all text boxes |
| Verify an action | Screenshot before and after → compare or OCR for expected text |
App-specific adaptations:
| App type | Special considerations |
|---|
| Chat apps (WeChat, Slack, etc.) | Verify conversation title before typing; use insert-newline for multi-line; verify send mechanism |
| Browsers (Chrome, Safari, etc.) |
Address bar at top; content area varies; may need to handle tabs |
| System Settings | Deep navigation; panels change; re-get bounds after each navigation |
| File managers (Finder, Explorer) | Sidebar + content area; double-click to open; path bar for navigation |
| Editors (VS Code, TextEdit, etc.) | Tab bar + editor area; use hotkeys for save/undo; type in editor area |
Text Input and Send Rules
Typing text
$PY scripts/desktop_ops.py type --text "your message"
- - Uses clipboard paste as primary method on all platforms — reliable for all languages including CJK
- macOS:
set the clipboard to + Cmd+V (single osascript call) - Windows: PowerShell
Set-Clipboard + Ctrl+V (falls back to clip.exe) - Linux:
xclip + INLINECODE22 - First click on the input field to focus it before typing
Multi-line messages
$PY scripts/desktop_ops.py type --text "first line"
$PY scripts/desktop_ops.py insert-newline
$PY scripts/desktop_ops.py type --text "second line"
- - Use
insert-newline for literal line breaks - Do NOT use
\n in type --text — it may trigger send in some apps
Sending a message
- 1. Preferred: Look for a visible send button (e.g.,
发送) via OCR, then click it - Alternative: Use
press --key return ONLY when the app is verified to use Enter-to-send - Never guess which send method to use — verify first
Backend priority (macOS)
| Operation | Primary | Fallback |
|---|
| INLINECODE28 | Clipboard paste | cliclick (ASCII only) |
| INLINECODE29 |
AppleScript
key code | cliclick
kp: |
|
hotkey | cliclick
kd:/t:/ku: | pyautogui |
|
click | cliclick | pyautogui |
Important: cliclick kp:return is NOT recognized by WeChat — always use AppleScript for key press.
Important: cliclick t: silently drops CJK characters — always use clipboard paste for text input.
DPI / HiDPI / Retina (All Platforms)
Handled automatically. No manual DPI work needed.
| Platform | Common scales | Detection method |
|---|
| macOS Retina | 2.0x | screenshot pixels ÷ logical screen bounds |
| Windows HiDPI |
1.25x, 1.5x, 2.0x | screenshot pixels ÷ pyautogui.size() |
| Linux X11 | 1.0x, 1.5x, 2.0x | screenshot pixels ÷ pyautogui.size() |
OCR output: box = logical (use for mouse), pixel_box = raw pixels, dpi_scale = factor.
CLI Quick Reference (EXACT parameter names)
CRITICAL: Use EXACTLY these names. Do NOT guess.
desktop_ops.py
CODEBLOCK8
ocr_text.py
CODEBLOCK9
target_resolver.py
CODEBLOCK10
taskcontext.py / cleanuptask.py
CODEBLOCK11
window_regions.py
CODEBLOCK12
Labels: top_search, left_sidebar, left_sidebar_top, title_header, content_area, toolbar_row, bottom_input, primary_action
Workflow Examples
Example 1: Click a button by text (any app)
CODEBLOCK13
Example 2: Type and search
CODEBLOCK14
Example 3: Send a chat message (WeChat, Slack, etc.)
CODEBLOCK15
Example 4: Scroll a list and find an item
CODEBLOCK16
Example 5: Handle an unexpected dialog
CODEBLOCK17
Reference Documents
Load as needed:
| Document | When to read |
|---|
| INLINECODE48 | Core 8-step closed loop |
| INLINECODE49 |
macOS-specific tools and permissions |
|
references/platform-windows.md | Windows setup |
|
references/platform-linux.md | Linux X11/Wayland setup |
|
references/operation-patterns.md | Reusable task templates |
|
references/validation-patterns.md | Two-stage validation |
|
references/precise-targeting.md | 5-layer precision targeting |
|
references/target-providers.md | Provider ordering and fallback contract |
|
references/coordinate-reconstruction.md | Rebuild click coordinates from screenshot evidence |
|
references/chat-app-macos.md | Chat app workflow |
|
references/app-wechat-desktop.md | Cross-platform WeChat guidance |
|
references/cleanup-rules.md | Cleanup timing and scope |
|
references/collaboration-rules.md | When multi-agent collaboration is justified |
|
references/example-cases.md | Repeatable task examples |
|
references/reproducible-setup.md | Host bring-up checklist |
Scope
Use this skill for: chat apps, browsers, file managers, editors, office apps, system settings, any closed desktop software with no usable API.
Hard Rules
- 1. Always run auto-setup gate first
- Always use EXACT parameter names from CLI reference — never guess
- Always scope OCR to the target app window — NEVER full-screen OCR
- Always: focus-app → front-window-bounds → OCR within window → verify → act
- Always pass
--python $PY to ocrtext.py and targetresolver.py - Always verify coordinates are within window bounds before clicking
- Always re-get window bounds after any UI state change (login, dialog, navigation)
- Use
insert-newline for line breaks; never use \n in type --text - For send actions: prefer visible send button; use
press --key return only when verified - One action at a time; verify after each
- Maximum 3 retries per action; each retry must recapture fresh state
- Cleanup is mandatory at task end
- If verification fails, recapture and rebuild — do not retry blindly
桌面代理操作
将此技能作为桌面图形界面任务的主代理操作手册使用。
强制要求:自动设置门控(每次会话的首要操作)
bash
python3 DIR>/scripts/firstrun_setup.py --check
如果 ready: false,则运行设置(自动安装所有内容):
bash
python3 DIR>/scripts/firstrun_setup.py
首次运行时自动安装:
- 1. 平台检测(macOS / Windows / Linux)
- cliclick + tesseract(macOS 通过 brew 安装;Linux 会打印指南)
- 根据系统区域设置自动检测 OCR 语言包(中文→chi_sim,日本語→jpn 等)
- Python 虚拟环境 + pillow、pyautogui、pytesseract、opencv-python、numpy(通过 uv 或 pip)
- 操作系统权限(屏幕录制、辅助功能、自动化),自动打开系统设置
- 冒烟测试(截图 + 鼠标移动验证)
设置完成后,为所有后续调用设置 $PY:
PY=AGENTOPS_PYTHON>
如果设置未就绪,请勿继续。
核心执行循环
每个桌面任务都遵循此循环。无例外。
1. 自动设置门控 ← 每个会话运行一次
2. 初始化任务上下文 ← 创建隔离的临时目录
3. 聚焦目标应用 ← 将应用置于前台,确认在最前
4. 获取窗口边界 ← 了解精确位置和大小
5. 捕获该窗口 ← 仅截取目标窗口
6. 分析捕获内容 ← 读取截图或运行 OCR
7. 通过 OCR 定位目标 ← 在窗口边界内查找文本/按钮
8. 操作前验证 ← 移动光标,截取带光标的截图,确认
9. 执行一个操作 ← 点击、输入、滚动、按键
- 10. 再次捕获 ← 截图查看结果
- 验证结果 ← 用户界面是否按预期变化?
- → 如果还有更多步骤,转到第 5 步
- 清理 ← 删除任务临时目录
关键原则:
- - 一次只执行一个操作。切勿盲目连锁操作。
- 每次操作后务必验证。如果验证失败,重新捕获并重试——不要猜测。
- 始终在特定窗口内操作。切勿基于全屏假设进行点击。
窗口范围定位(正确方式)
切勿在全屏截图上进行 OCR 或点击。 始终将范围限定在目标应用窗口内。
六步流程
┌─────────────────────────────────────────────────────────┐
│ 第 1 步:聚焦目标应用 │
│ $PY desktop_ops.py focus-app --name AppName │
│ → 将应用置于前台 │
├─────────────────────────────────────────────────────────┤
│ 第 2 步:获取窗口边界 │
│ $PY desktop_ops.py front-window-bounds --app AppName│
│ → 返回逻辑坐标 {x, y, width, height} │
├─────────────────────────────────────────────────────────┤
│ 第 3 步:仅捕获该窗口 │
│ $PY desktop_ops.py capture-region --x X --y Y │
│ --width W --height H --output /tmp/window.png │
├─────────────────────────────────────────────────────────┤
│ 第 4 步:在窗口内进行 OCR │
│ $PY ocr_text.py --app AppName --python $PY │
│ → abs_box 坐标位于窗口内部 │
├─────────────────────────────────────────────────────────┤
│ 第 5 步:点击前验证 │
│ $PY desktop_ops.py move --x TX --y TY │
│ $PY desktop_ops.py screenshot --with-cursor │
│ → 确认光标在正确的元素上 │
├─────────────────────────────────────────────────────────┤
│ 第 6 步:仅验证通过后点击 │
│ $PY desktop_ops.py click --x TX --y TY │
│ $PY desktop_ops.py screenshot → 验证结果 │
└─────────────────────────────────────────────────────────┘
快捷方式(推荐用于大多数定位):
bash
$PY scripts/target_resolver.py --app AppName --text 按钮文字 --python $PY
此单个命令:聚焦应用 → 获取边界 → 在窗口内进行 OCR → 返回包含 {x, y, withinwindow} 的 bestcandidate。
为什么窗口范围很重要:
| 方法 | 风险 |
|---|
| ❌ 全屏 OCR | 微信和 Chrome 中的搜索 → 点击错误的应用 |
| ✅ 窗口范围 |
仅在微信窗口中的搜索 → 正确点击 |
故障恢复(关键)
当出现问题时,遵循以下规则:
OCR 未找到任何内容
- 1. 重新聚焦应用:focus-app --name AppName
- 重新获取边界:front-window-bounds --app AppName(窗口可能已移动/调整大小)
- 截取新截图并目视读取
- 尝试不同的区域标签(例如用 contentarea 代替 bottominput)
- 降低 OCR 置信度:--min-conf 30
点击无效
- 1. 截取带光标的截图以检查光标位置
- 窗口可能已移动——重新获取边界
- 尝试点击距离 OCR 中心偏移几个像素的位置
- 检查是否有对话框/弹出窗口阻挡了目标
应用状态改变(登录屏幕、对话框等)
- 1. 任何重大用户界面变化后务必重新获取窗口边界
- 导航或状态变化后务必重新运行 OCR
- 切勿重复使用旧坐标——它们可能已过时
通用重试规则
- - 每个操作最多重试 3 次
- 每次重试必须重新捕获最新状态
- 如果 3 次重试均失败,附上截图报告失败并停止
泛化:如何将此应用于任何应用
此流程适用于任何桌面应用程序。以下是如何处理新应用的方法:
任何新应用的逐步指南:
- 1. 识别应用名称,与系统中显示的名称完全一致(例如Google Chrome、微信、系统设置)
- 聚焦并获取边界——这告诉你窗口的精确位置
- 截取窗口截图——查看屏幕上的内容
- 识别目标——你需要与哪些文本、按钮或区域进行交互?
- 使用 OCR 找到它——target_resolver.py --app AppName --text 目标文本
- 验证并点击
跨应用的常见模式:
| 任务 | 操作方法 |
|---|
| 点击按钮 | OCR 查找文本 → 验证 → 点击 |
| 在字段中输入 |
OCR 查找字段标签 → 点击字段 → type --text |
| 搜索内容 | OCR 查找搜索框 → 点击 → 输入查询 → 按回车 |
| 滚动列表 | 获取窗口边界 → 在窗口中心使用 --x --y 滚动 |
| 在应用间切换 | focus-app --name OtherApp → 重新获取边界 |
| 处理对话框 | 截图 → OCR 查找对话框按钮 → 点击相应按钮 |
| 导航菜单 | 点击菜单项 → 等待 → 截图 → OCR 新菜单 → 点击 |
| 从下拉列表中选择 | 点击下拉列表 → 等待 → OCR 选项 → 点击选择 |
| 读取屏幕内容 | 对窗口进行 OCR → 提取所有文本框 |
| 验证操作 | 截图前后对比 → 比较或 OCR 查找预期文本 |
特定应用的适配:
| 应用类型 | 特殊考虑 |
|---|
| 聊天应用(微信、Slack 等) | 输入前验证对话标题;使用 insert-newline 换行;验证发送机制 |
| 浏览器(Chrome、Safari 等) |
地址栏在顶部;内容区域各异;可能需要处理标签页 |
| 系统设置 | 深层导航;面板变化;每次导航后重新获取边界 |
| 文件管理器(Finder、资源管理器) | 侧边栏 + 内容区域;双击打开;路径栏用于导航 |
| 编辑器(VS Code、TextEdit 等) | 标签栏 + 编辑器区域;使用快捷键保存/撤销;在编辑器区域输入 |
文本输入和发送规则
输入文本
bash
$PY scripts/desktop_ops.py type --text 你的消息
- - 在所有平台上主要使用剪贴板粘贴方法——可靠,支持包括中日韩在内的所有语言