Windows Control Skill

Full desktop automation for Windows. Control mouse, keyboard, and screen like a human user.

Quick Start

All scripts are in INLINECODE0

Screenshot

py screenshot.py > output.b64

Returns base64 PNG of entire screen.

Click

CODEBLOCK1

Type Text

py type_text.py "Hello World"

Types text at current cursor position (10ms between keys).

Press Keys

CODEBLOCK3

Move Mouse

py mouse_move.py 500 300

Moves mouse to coordinates (smooth 0.2s animation).

Scroll

CODEBLOCK5

Window Management (NEW!)

CODEBLOCK6

Advanced Actions (NEW!)

CODEBLOCK7

Read Window Text

py read_window.py "Notepad"           # Read all text from Notepad
py read_window.py "Visual Studio"     # Read text from VS Code
py read_window.py "Chrome"            # Read text from browser

Uses Windows UI Automation to extract actual text (not OCR). Much faster and more accurate than screenshots!

Read UI Elements (NEW!)

py read_ui_elements.py "Chrome"               # All interactive elements
py read_ui_elements.py "Chrome" --buttons-only  # Just buttons
py read_ui_elements.py "Chrome" --links-only    # Just links
py read_ui_elements.py "Chrome" --json          # JSON output

Returns buttons, links, tabs, checkboxes, dropdowns with coordinates for clicking.

Read Webpage Content (NEW!)

py read_webpage.py                     # Read active browser
py read_webpage.py "Chrome"            # Target Chrome specifically
py read_webpage.py "Chrome" --buttons  # Include buttons
py read_webpage.py "Chrome" --links    # Include links with coords
py read_webpage.py "Chrome" --full     # All elements (inputs, images)
py read_webpage.py "Chrome" --json     # JSON output

Enhanced browser content extraction with headings, text, buttons, and links.

Handle Dialogs (NEW!)

# List all open dialogs
py handle_dialog.py list

# Read current dialog content
py handle_dialog.py read
py handle_dialog.py read --json

# Click button in dialog
py handle_dialog.py click "OK"
py handle_dialog.py click "Save"
py handle_dialog.py click "Yes"

# Type into dialog text field
py handle_dialog.py type "myfile.txt"
py handle_dialog.py type "C:\path\to\file" --field 0

# Dismiss dialog (auto-finds OK/Close/Cancel)
py handle_dialog.py dismiss

# Wait for dialog to appear
py handle_dialog.py wait --timeout 10
py handle_dialog.py wait "Save As" --timeout 5

Handles Save/Open dialogs, message boxes, alerts, confirmations, etc.

Click Element by Name (NEW!)

py click_element.py "Save"                    # Click "Save" anywhere
py click_element.py "OK" --window "Notepad"   # In specific window
py click_element.py "Submit" --type Button    # Only buttons
py click_element.py "File" --type MenuItem    # Menu items
py click_element.py --list                    # List clickable elements
py click_element.py --list --window "Chrome"  # List in specific window

Click buttons, links, menu items by name without needing coordinates.

Read Screen Region (OCR - Optional)

py read_region.py 100 100 500 300     # Read text from coordinates

Note: Requires Tesseract OCR installation. Use read_window.py instead for better results.

Workflow Pattern

1. Read window - Extract text from specific window (fast, accurate)
Read UI elements - Get buttons, links with coordinates
Screenshot (if needed) - See visual layout
Act - Click element by name or coordinates
Handle dialogs - Interact with popups/save dialogs
Read window - Verify changes

Screen Coordinates

- Origin (0, 0) is top-left corner
Your screen: 2560x1440 (check with screenshot)
Use coordinates from screenshot analysis

Examples

Open Notepad and type

CODEBLOCK14

Click in VS Code

CODEBLOCK15

Monitor Notepad changes

CODEBLOCK16

Text Reading Methods

Method 1: Windows UI Automation (BEST)

- Use read_window.py for any window
Use read_ui_elements.py for buttons/links with coordinates
Use read_webpage.py for browser content with structure
Gets actual text data (not image-based)

Method 2: Click by Name (NEW)

- Use click_element.py to click buttons/links by name
No coordinates needed - finds elements automatically
Works across all windows or target specific window

Method 3: Dialog Handling (NEW)

- Use handle_dialog.py for popups, save dialogs, alerts
Read dialog content, click buttons, type text
Auto-dismiss with common buttons (OK, Cancel, etc.)

Method 4: Screenshot + Vision (Fallback)

- Take full screenshot
AI reads text visually
Slower but works for any content

Method 5: OCR (Optional)

- Use read_region.py with Tesseract
Requires additional installation
Good for images/PDFs with text

Safety Features

- pyautogui.FAILSAFE = True (move mouse to top-left to abort)
Small delays between actions
Smooth mouse movements (not instant jumps)

Requirements

- Python 3.11+
pyautogui (installed ✅)
pillow (installed ✅)

Tips

- Always screenshot first to see current state
Coordinates are absolute (not relative to windows)
Wait briefly after clicks for UI to update
Use ctrl+z friendly actions when possible

Status: ✅ READY FOR USE (v2.0 - Dialog & UI Elements) Created: 2026-02-01 Updated: 2026-02-02

Windows 控制技能

Windows 全桌面自动化。像人类用户一样控制鼠标、键盘和屏幕。

快速开始

所有脚本位于 skills/windows-control/scripts/ 目录下

截屏

bash py screenshot.py > output.b64

返回整个屏幕的 base64 编码 PNG 图片。

点击

bash py click.py 500 300 # 在 (500, 300) 处左键单击 py click.py 500 300 right # 右键单击 py click.py 500 300 left 2 # 双击左键

输入文本

bash py type_text.py Hello World

在当前光标位置输入文本（按键间隔 10 毫秒）。

按键操作

bash py key_press.py enter py key_press.py ctrl+s py key_press.py alt+tab py key_press.py ctrl+shift+esc

移动鼠标

bash py mouse_move.py 500 300

将鼠标移动到指定坐标（0.2 秒平滑动画）。

滚动

bash py scroll.py up 5 # 向上滚动 5 格 py scroll.py down 10 # 向下滚动 10 格

窗口管理（新增！）

bash py focus_window.py Chrome # 将窗口置于前台 py minimize_window.py Notepad # 最小化窗口 py maximize_window.py VS Code # 最大化窗口 py close_window.py Calculator # 关闭窗口 py getactivewindow.py # 获取活动窗口标题

高级操作（新增！）

bash

按文本点击（无需坐标！）

py click_text.py Save # 点击任意位置的保存按钮 py click_text.py Submit Chrome # 仅在 Chrome 中点击提交

拖放操作

py drag.py 100 100 500 300 # 从 (100,100) 拖到 (500,300)

稳健自动化（等待/查找）

py waitfortext.py Ready App 30 # 等待文本出现，最长 30 秒 py waitforwindow.py Notepad 10 # 等待窗口出现 py find_text.py Login Chrome # 获取文本坐标 py list_windows.py # 列出所有打开的窗口

读取窗口文本

bash py read_window.py Notepad # 读取记事本中的所有文本 py read_window.py Visual Studio # 读取 VS Code 中的文本 py read_window.py Chrome # 读取浏览器中的文本

使用 Windows UI 自动化提取实际文本（非 OCR）。比截屏更快更准确！

读取 UI 元素（新增！）

bash py readuielements.py Chrome # 所有交互元素 py readuielements.py Chrome --buttons-only # 仅按钮 py readuielements.py Chrome --links-only # 仅链接 py readuielements.py Chrome --json # JSON 格式输出

返回按钮、链接、标签页、复选框、下拉菜单及其点击坐标。

读取网页内容（新增！）

bash py read_webpage.py # 读取活动浏览器 py read_webpage.py Chrome # 专门针对 Chrome py read_webpage.py Chrome --buttons # 包含按钮 py read_webpage.py Chrome --links # 包含链接及坐标 py read_webpage.py Chrome --full # 所有元素（输入框、图片） py read_webpage.py Chrome --json # JSON 格式输出

增强的浏览器内容提取，包含标题、文本、按钮和链接。

处理对话框（新增！）

bash

列出所有打开的对话框

py handle_dialog.py list

读取当前对话框内容

py handle_dialog.py read py handle_dialog.py read --json

点击对话框中的按钮

py handle_dialog.py click OK py handle_dialog.py click Save py handle_dialog.py click Yes

在对话框文本框中输入

py handle_dialog.py type myfile.txt py handle_dialog.py type C:\path\to\file --field 0

关闭对话框（自动查找确定/关闭/取消）

py handle_dialog.py dismiss

等待对话框出现

py handle_dialog.py wait --timeout 10 py handle_dialog.py wait Save As --timeout 5

处理保存/打开对话框、消息框、警告、确认框等。

按名称点击元素（新增！）

bash py click_element.py Save # 点击任意位置的保存 py click_element.py OK --window Notepad # 在特定窗口中 py click_element.py Submit --type Button # 仅按钮 py click_element.py File --type MenuItem # 菜单项 py click_element.py --list # 列出可点击元素 py click_element.py --list --window Chrome # 在特定窗口中列出

按名称点击按钮、链接、菜单项，无需坐标。

读取屏幕区域（OCR - 可选）

bash py read_region.py 100 100 500 300 # 从坐标区域读取文本

注意：需要安装 Tesseract OCR。建议使用 read_window.py 以获得更好效果。

工作流程模式

1. 读取窗口 - 从特定窗口提取文本（快速、准确）
读取 UI 元素 - 获取按钮、链接及其坐标
截屏（如需要）- 查看视觉布局
执行操作 - 按名称或坐标点击元素
处理对话框 - 与弹出窗口/保存对话框交互
读取窗口 - 验证更改

屏幕坐标

- 原点 (0, 0) 为左上角
您的屏幕：2560x1440（可通过截屏确认）
使用截屏分析获取坐标

示例

打开记事本并输入

bash

按下 Windows 键

py key_press.py win

输入notepad

py type_text.py notepad

按回车

py key_press.py enter

稍等片刻，然后输入

py type_text.py Hello from AI!

保存

py key_press.py ctrl+s

在 VS Code 中点击

bash

读取当前 VS Code 内容

py read_window.py Visual Studio Code

在特定位置点击（例如文件资源管理器）

py click.py 50 100

输入文件名

py type_text.py test.js

按回车

py key_press.py enter

验证新文件已打开

py read_window.py Visual Studio Code

监控记事本变化

bash

读取当前内容

py read_window.py Notepad

用户输入一些内容...

读取更新后的内容（无需截屏！）

py read_window.py Notepad

文本读取方法

方法 1：Windows UI 自动化（最佳）

- 使用 readwindow.py 读取任意窗口
使用 readuielements.py 获取按钮/链接及坐标
使用 readwebpage.py 获取带结构的浏览器内容
获取实际文本数据（非图像识别）

方法 2：按名称点击（新增）

- 使用 click_element.py 按名称点击按钮/链接
无需坐标 - 自动查找元素
可跨所有窗口操作或针对特定窗口

方法 3：对话框处理（新增）

- 使用 handle_dialog.py 处理弹出窗口、保存对话框、警告
读取对话框内容、点击按钮、输入文本
使用常用按钮（确定、取消等）自动关闭

方法 4：截屏 + 视觉识别（备用方案）

- 拍摄全屏截图
AI 视觉识别文本
速度较慢但适用于任何内容

方法 5：OCR（可选）

- 使用 read_region.py 配合 Tesseract
需要额外安装
适用于包含文本的图片/PDF

安全特性

- pyautogui.FAILSAFE = True（将鼠标移至左上角可中止）
操作间有小延迟
平滑鼠标移动（非瞬间跳转）

系统要求

- Python 3.11+
pyautogui（已安装 ✅）
pillow（已安装 ✅）

使用技巧

- 始终先截屏查看当前状态
坐标为绝对坐标（非相对于窗口）
点击后稍

windows-controlWindows桌面控制

windows-control

Windows Control Skill

Quick Start

Screenshot

Click

Type Text

Press Keys

Move Mouse

Scroll

Window Management (NEW!)

Advanced Actions (NEW!)

Read Window Text

Read UI Elements (NEW!)

Read Webpage Content (NEW!)

Handle Dialogs (NEW!)

Click Element by Name (NEW!)

Read Screen Region (OCR - Optional)

Workflow Pattern

Screen Coordinates

Examples

Open Notepad and type

Click in VS Code

Monitor Notepad changes

Text Reading Methods

Safety Features

Requirements

Tips

Windows 控制技能

快速开始

截屏

点击

输入文本

按键操作

移动鼠标

滚动

窗口管理（新增！）

高级操作（新增！）

按文本点击（无需坐标！）

拖放操作

稳健自动化（等待/查找）

读取窗口文本

读取 UI 元素（新增！）

读取网页内容（新增！）

处理对话框（新增！）

列出所有打开的对话框

读取当前对话框内容

点击对话框中的按钮

在对话框文本框中输入

关闭对话框（自动查找确定/关闭/取消）

等待对话框出现

按名称点击元素（新增！）

读取屏幕区域（OCR - 可选）

工作流程模式

屏幕坐标

示例

打开记事本并输入

按下 Windows 键

输入notepad

按回车

稍等片刻，然后输入

保存

在 VS Code 中点击

读取当前 VS Code 内容

在特定位置点击（例如文件资源管理器）

输入文件名

按回车

验证新文件已打开

监控记事本变化

读取当前内容

用户输入一些内容...

读取更新后的内容（无需截屏！）

文本读取方法

安全特性

系统要求

使用技巧

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源