UI Element Ops
Parse one or more screenshots into a machine-readable JSON schema with:
- -
type (normalized UI element type) - INLINECODE1 and INLINECODE2
- INLINECODE3 (OCR/caption content when available)
- INLINECODE4 flag
- optional overlay image with labeled boxes
- desktop actions via
scripts/operate_ui.py (click/type/key/hotkey/screenshot) - element query and orchestration via
scripts/operate_ui.py (find, wait) - coordinate calibration profile for multi-display/DPI/window offset (
calibrate)
Quick Start
- 1. Prepare runtime once per machine:
CODEBLOCK0
- 2. Parse one screenshot:
CODEBLOCK1
- 3. Read outputs:
- - INLINECODE10
- INLINECODE11
- 4. One-step capture + parse with randomized names:
CODEBLOCK2
Workflow
- 1. Confirm screenshot path and desired output path.
- Run
scripts/bootstrap_omniparser_env.sh when .venv or OmniParser weights are missing. - Run
scripts/run_parse_ui.sh for standard parsing. - Report absolute output paths and summary counts:
total, clickable, by_type. - Call out obvious quality risks for tiny text or dense icon layouts.
- Execute desktop actions when requested:
- list elements:
python3 skills/ui-element-ops/scripts/operate_ui.py list --elements <json>
- find elements:
python3 skills/ui-element-ops/scripts/operate_ui.py find --elements <json> --type button --text-contains login
- wait for appear/disappear:
python3 skills/ui-element-ops/scripts/operate_ui.py wait --elements <json> --state appear --text-contains continue
- click by id:
python3 skills/ui-element-ops/scripts/operate_ui.py click --elements <json> --id e_0001
- screenshot:
python3 skills/ui-element-ops/scripts/operate_ui.py screenshot (defaults to user tmp dir)
- calibrate coordinates: INLINECODE23
Tunables
- - Edit type mapping keywords in
references/type_rules.example.json. - Use advanced parser args via
scripts/parse_ui.py --help. - Use
--use-paddleocr only when paddleocr/paddlepaddle are installed.
Outputs
-
schema_version,
pipeline,
image,
counts,
elements
- each element has
id,
type,
bbox_px,
bbox_norm,
text,
clickable
- same screenshot with labeled detection boxes
Failure Handling
- - Missing dependencies or weights: run bootstrap script again.
- Permission/cache errors under
$HOME: keep temporary caches under /tmp (handled by run script). - CPU-only machine: expect slower inference.
- Performance note: parse/capture-and-parse commands are heavy; avoid very tight loops and reuse recent
elements.json when possible. - Headless environment limitation:
- usable without GUI: parse/list/find/wait/calibrate on existing files.
- requires GUI session: click/click-xy/type/key/hotkey/screenshot/screen-info.
UI 元素操作
将一张或多张截图解析为机器可读的 JSON 结构,包含:
- - type(标准化 UI 元素类型)
- bboxpx 和 bboxnorm
- text(可用时的 OCR/字幕内容)
- clickable 标记
- 可选带标签框的叠加图像
- 通过 scripts/operateui.py 执行的桌面操作(点击/输入/按键/热键/截图)
- 通过 scripts/operateui.py 实现的元素查询与编排(find、wait)
- 用于多显示器/DPI/窗口偏移的坐标校准配置文件(calibrate)
快速开始
- 1. 每台机器只需准备一次运行环境:
bash
skills/ui-element-ops/scripts/bootstrap
omniparserenv.sh $PWD
- 2. 解析一张截图:
bash
skills/ui-element-ops/scripts/run
parseui.sh /abs/path/to/1.jpeg
- 3. 读取输出:
- - .elements.json
- .overlay.png
- 4. 一步完成截图+解析,使用随机名称:
bash
skills/ui-element-ops/scripts/capture
andparse.sh
工作流程
- 1. 确认截图路径和期望的输出路径。
- 当缺少 .venv 或 OmniParser 权重时,运行 scripts/bootstrapomniparserenv.sh。
- 运行 scripts/runparseui.sh 进行标准解析。
- 报告绝对输出路径和汇总计数:total、clickable、by_type。
- 对于小字体或密集图标布局,指出明显的质量风险。
- 按需执行桌面操作:
- 列出元素:python3 skills/ui-element-ops/scripts/operate_ui.py list --elements
- 查找元素:python3 skills/ui-element-ops/scripts/operate_ui.py find --elements --type button --text-contains login
- 等待出现/消失:python3 skills/ui-element-ops/scripts/operate_ui.py wait --elements --state appear --text-contains continue
- 按 ID 点击:python3 skills/ui-element-ops/scripts/operateui.py click --elements --id e0001
- 截图:python3 skills/ui-element-ops/scripts/operate_ui.py screenshot(默认保存到用户临时目录)
- 校准坐标:python3 skills/ui-element-ops/scripts/operate_ui.py calibrate --parsed-size --actual-size
可调参数
- - 在 references/typerules.example.json 中编辑类型映射关键词。
- 通过 scripts/parseui.py --help 使用高级解析器参数。
- 仅在安装了 paddleocr/paddlepaddle 时使用 --use-paddleocr。
输出
- schema_version、pipeline、image、counts、elements
- 每个元素包含 id、type、bboxpx、bboxnorm、text、clickable
- 带有标记检测框的同一截图
故障处理
- - 缺少依赖或权重:重新运行引导脚本。
- $HOME 下的权限/缓存错误:将临时缓存保存在 /tmp 下(由运行脚本处理)。
- 仅 CPU 机器:推理速度会较慢。
- 性能说明:解析/截图并解析命令较重;避免非常紧密的循环,尽可能重用最近的 elements.json。
- 无头环境限制:
- 无 GUI 可用:对现有文件进行解析/列表/查找/等待/校准。
- 需要 GUI 会话:点击/点击坐标/输入/按键/热键/截图/屏幕信息。