Ghosthand
Ghosthand is a loopback HTTP server on the Android phone. All interaction is via HTTP GET, POST, and a small amount of DELETE to http://127.0.0.1:5583.
Always do this first:
| Step | Command | Purpose |
|---|
| 1 | INLINECODE4 | Is Ghosthand alive? |
| 2 |
GET /state | Is the runtime healthy, and is the capability you need usable now? |
| 3 |
GET /screen?source=accessibility | What is the current actionable surface? |
Use this skill to operate Ghosthand as an Android agent substrate.
Ghosthand is not generic Android advice. It is a local runtime with a route-based control surface. Use this skill only when the task is actually about Ghosthand routes, Ghosthand capability state, or acting through Ghosthand.
What Ghosthand is
Ghosthand exposes a local HTTP API for Android observation and control. The important categories are:
- - runtime and health:
/ping, /health, /state, /device, /foreground, /commands, INLINECODE13 - structured UI inspection:
/screen, /tree, /focused, INLINECODE17 - semantic or coordinate interaction:
/click, /tap, /input, /type, /setText, /scroll, /swipe, /longpress, INLINECODE26 - app and navigation control:
/back, /home, INLINECODE29 - sensing and transport:
/screenshot, /wait, /clipboard, INLINECODE33
Treat /commands as the current machine-readable capability catalog when route details matter.
When to use this skill
Use it when the task requires any of the following:
- - checking whether Ghosthand is running or ready
- checking whether a capability is both authorized by Android and allowed by Ghosthand policy
- inspecting the current Android surface before acting
- finding or clicking UI targets by
text, desc, or INLINECODE37 - recovering from Ghosthand misses or ambiguous action results
- using Ghosthand to type, scroll, swipe, wait, read clipboard, or read notifications
- debugging Ghosthand-specific behaviors such as partial output, stale assumptions about selectors, or snapshot-scoped node IDs
Do not use it for:
- - generic Android usage advice unrelated to Ghosthand
- root-only methods that Ghosthand does not expose
- imaginary routes or undocumented behavior when
/commands can answer directly
Operating model
1. Start from truth, not intent
Before acting, establish three things:
- 1. Is Ghosthand alive and usable?
- What surface is actually visible now?
- Which selector surface and route shape are most plausible for the target?
Typical order:
- 1. read INLINECODE39
- read INLINECODE40
- read
/commands if route shape, selector support, or response fields are uncertain - read
/screen?source=accessibility for the current actionable surface - if accessibility read is unavailable or clearly insufficient, retry with
/screen?source=hybrid or INLINECODE44 - only then choose
/find, /click, or INLINECODE47
2. Capability access has two layers
A capability is usable only when both are true:
- - Android/system authorization exists
- Ghosthand policy allows the capability
Do not confuse “permission granted” with “usable now”. Read /state before diagnosing failures, especially for accessibility and screenshot capture.
INLINECODE49 is the best live summary. /capabilities is the fuller catalog-style view when an agent needs route-capability mapping and availability details.
3. Node IDs are snapshot-scoped
Treat nodeId as ephemeral. Do not cache it across fresh observations unless the snapshot context is clearly the same. Prefer re-resolving via /screen, /find, or selector-based /click instead of assuming old node IDs remain valid.
Primitive selection
/screen
Use /screen first when you need a compact actionable view. The default mode is source=accessibility.
Use it to answer:
- - what is visible now
- which elements are actionable, editable, or scrollable
- whether coordinates are trustworthy enough for INLINECODE58
- whether the current surface even contains the target
Important details:
- -
source=accessibility is the default and supports editable, scrollable, clickable, and package filters - INLINECODE64 or
source=ocr is useful when accessibility is temporarily unavailable or operationally insufficient - INLINECODE66 is for compact orientation, not detailed targeting
- INLINECODE67 is a hint that a lightweight screenshot fetch is available;
/screen does not embed image bytes
If /screen reports partialOutput=true, warnings, foreground drift, or fallback hints, do not assume you saw the whole surface. Escalate to /tree, /screenshot, or a non-accessibility screen mode before blaming the app.
/tree
Use /tree when you need fuller structure, raw hierarchy, or to inspect why /screen may have omitted or shaped output. Use it for diagnosis and structural truth, not as your default first read.
/find
Use /find when you already have a selector hypothesis and want a bounded lookup.
Prefer it when you need:
- - selector testing before interaction
- disambiguation by INLINECODE78
- confirmation that a target exists before a coordinate fallback
- inspection of whether a visible label is discoverable on
text, contentDesc, resourceId, or only as a focused node
A miss usually means one of four things:
- - wrong screen
- wrong selector surface
- wrong match semantics
- target not exposed the way you assumed
Supported strategies are text, textContains, contentDesc, contentDescContains, resourceId, and focused. text, desc, and id are convenience aliases in the request body; Ghosthand normalizes them internally.
/click
Prefer /click over /tap when you have a plausible semantic target. Ghosthand can resolve wrapper targets, bounded selector fallbacks, and clickable ancestors, then expose how it actually landed on an actionable node.
Use /click first for:
- - text-labeled controls
- content-description labeled controls
- stable resource IDs
- cases where ancestor click resolution may help
For selector-based /click, Ghosthand treats clickable=true as the default unless you explicitly set clickable=false. That default is optimized for action, not inspection. Use /find or disable clickable resolution when you need to inspect the raw matched node.
/tap
Use /tap only when coordinates come from the current trusted surface. Do not guess coordinates. Coordinate fallback is justified only after semantic targeting has narrowed the uncertainty.
/input and /setText
Use /input for the focused editable field. Prefer it over /type when you need explicit text mutation or Enter dispatch semantics.
Use /type only for simpler focused text entry when the current focus is already correct.
Use /setText only when you have a trusted same-snapshot editable nodeId and need to target that exact node.
When entering text, do not assume the Enter key will successfully submit or confirm the input. If Enter does not work or the field remains uncommitted, use the on-screen IME confirmation action instead, typically the confirm button in the bottom-right corner of the keyboard.
/scroll and /swipe
Use /scroll when the goal is container movement or list advancement.
Use /swipe when the task is truly geometric.
Do not interpret performed=true as proof that content changed. Check returned change fields, then verify with /screen, /tree, or /wait.
/wait
Use /wait after actions that may change UI state.
There are two different uses:
- -
GET /wait: wait for UI change and inspect final settled state - INLINECODE119 : wait for a selector condition
Do not confuse changed=false with action failure. It only means a transition was not observed during the wait window. Re-check the final surface before concluding the action failed.
For POST /wait, the supported strategies are bounded and query rules matter: focused takes no query, while text/content-description/resource-id waits require one.
/clipboard, /notify, /screenshot
Use /clipboard as a transport primitive for long text or repeated entry.
Use /notify to read or post local notifications only when the task is explicitly notification-related.
Use /screenshot when visual truth is needed and structured UI output is insufficient, ambiguous, or suspected stale.
Important details:
- -
/screenshot supports GET and INLINECODE131 - width and height must be provided together or omitted together
- screenshot capability is separately policy-gated from accessibility
- if
/screen publishes previewPath, use that exact path before inventing a new screenshot size
Selector judgment
Selectors are not interchangeable.
text
Use text when the visible label is likely the actual text field of the node.
desc
Use desc when the control is icon-like, accessibility-labeled, nav-like, or visibly sparse. Many controls that look label-based are actually better matched through content description.
id
Use id when a meaningful resourceId is present. This is often the strongest selector.
Exact vs contains
Do not over-read exact-match misses.
If the visible phrase may be part of a longer text block, retry with a contains-style strategy where the route supports it. A visible phrase on screen is not proof that exact text lookup should succeed.
INLINECODE141 supports explicit contains strategies. /click can use bounded contains fallback internally and tells you when it did so; do not mistake that for an exact selector hit.
Recovery rules
When a Ghosthand action misses, do not branch into random retries. Make one bounded correction:
- - re-read INLINECODE143
- if accessibility is unavailable or weak, re-read
/screen?source=hybrid or INLINECODE145 - switch
text to desc or INLINECODE148 - switch exact semantics to contains semantics when justified
- if text entry succeeded but submission did not, use the on-screen IME confirm action instead of retrying Enter
- move from
/click to /tap only after trustworthy coordinates exist - use
/capabilities when the route exists but capability availability is ambiguous - use
/wait to settle state before the next action
Repeated misses should be classified, not brute-forced.
Minimal workflows
Check whether Ghosthand is ready
- 1. read INLINECODE153
- read INLINECODE154
- if needed, read INLINECODE155
- if needed, read INLINECODE156
Operate a visible control safely
- 1. read INLINECODE157
- choose
text, desc, or INLINECODE160 - call INLINECODE161
- call
/wait or re-read INLINECODE163 - if accessibility surface truth is weak, retry
/screen?source=hybrid or INLINECODE165 - only use
/tap if semantic action remains weak but coordinates are trusted
Enter text and confirm it reliably
- 1. focus the intended editable field
- use
/input for the focused field, /type for simple focused typing, or /setText for a trusted same-snapshot editable INLINECODE170 - verify the text appears in the field or the focused surface reflects the update
- if Enter does not submit or confirm the input, use the on-screen IME confirm action, typically the bottom-right keyboard button
- call
/wait or re-read /screen to confirm the post-input state
Diagnose a miss
- 1. confirm Ghosthand and capability state with INLINECODE173
- re-read INLINECODE174
- inspect selector surface mismatch
- escalate to
/screen?source=hybrid, /tree, or /screenshot if accessibility output is partial, unavailable, or misleading - retry one bounded correction
Reporting standard
When summarizing a Ghosthand run, report only:
- - what route you used
- what state changed
- whether the target was achieved
- the first narrow failing step if it was not
- the next best correction
Do not dump logs unless the task is explicitly diagnostic.
Reference files
Detailed route notes are in resources/references/ghosthand-api-quick-reference.md.
Ghosthand
Ghosthand 是安卓手机上的一个回环 HTTP 服务器。所有交互均通过 HTTP 的 GET、POST 以及少量 DELETE 请求发送至 http://127.0.0.1:5583。
始终优先执行以下步骤:
| 步骤 | 命令 | 目的 |
|---|
| 1 | GET /ping | Ghosthand 是否存活? |
| 2 |
GET /state | 运行时是否健康,所需能力当前是否可用? |
| 3 | GET /screen?source=accessibility | 当前可操作的界面是什么? |
使用此技能将 Ghosthand 作为安卓代理基础进行操作。
Ghosthand 并非通用的安卓建议。它是一个具有基于路由控制面的本地运行时。仅当任务确实涉及 Ghosthand 路由、Ghosthand 能力状态或通过 Ghosthand 执行操作时,才使用此技能。
Ghosthand 是什么
Ghosthand 为安卓观察和控制暴露了一个本地 HTTP API。重要的类别包括:
- - 运行时与健康状态:/ping、/health、/state、/device、/foreground、/commands、/capabilities
- 结构化 UI 检查:/screen、/tree、/focused、/find
- 语义或坐标交互:/click、/tap、/input、/type、/setText、/scroll、/swipe、/longpress、/gesture
- 应用与导航控制:/back、/home、/recents
- 感知与传输:/screenshot、/wait、/clipboard、/notify
当路由细节重要时,将 /commands 视为当前机器可读的能力目录。
何时使用此技能
当任务需要以下任何一项时使用:
- - 检查 Ghosthand 是否正在运行或已就绪
- 检查某项能力是否既获得安卓授权又被 Ghosthand 策略允许
- 在操作前检查当前安卓界面
- 通过 text、desc 或 id 查找或点击 UI 目标
- 从 Ghosthand 未命中或模糊的操作结果中恢复
- 使用 Ghosthand 进行输入、滚动、滑动、等待、读取剪贴板或读取通知
- 调试 Ghosthand 特定行为,例如部分输出、关于选择器的过时假设或快照作用域内的节点 ID
不要将其用于:
- - 与 Ghosthand 无关的通用安卓使用建议
- Ghosthand 未暴露的仅 root 方法
- 当 /commands 可以直接回答时,使用虚构的路由或未记录的行为
操作模型
1. 从事实出发,而非意图
在操作之前,确认三件事:
- 1. Ghosthand 是否存活且可用?
- 当前实际可见的界面是什么?
- 对于目标,哪个选择器界面和路由形状最合理?
典型顺序:
- 1. 读取 /ping
- 读取 /state
- 如果路由形状、选择器支持或响应字段不确定,读取 /commands
- 读取 /screen?source=accessibility 获取当前可操作界面
- 如果无障碍读取不可用或明显不足,使用 /screen?source=hybrid 或 /screen?source=ocr 重试
- 然后才选择 /find、/click 或 /tap
2. 能力访问有两层
一项能力仅在以下两者都为真时才可用:
- - 安卓/系统授权存在
- Ghosthand 策略允许该能力
不要将权限已授予与当前可用混淆。在诊断失败之前读取 /state,特别是对于无障碍和截图捕获。
/state 是最佳的实时摘要。当代理需要路由-能力映射和可用性详情时,/capabilities 是更完整的目录式视图。
3. 节点 ID 是快照作用域内的
将 nodeId 视为临时性的。除非快照上下文明显相同,否则不要跨新的观察结果缓存它。优先通过 /screen、/find 或基于选择器的 /click 重新解析,而不是假设旧的节点 ID 仍然有效。
原语选择
/screen
当需要紧凑的可操作视图时,首先使用 /screen。默认模式是 source=accessibility。
使用它来回答:
- - 当前可见什么
- 哪些元素是可操作的、可编辑的或可滚动的
- 坐标是否足够可信以用于 /tap
- 当前界面是否包含目标
重要细节:
- - source=accessibility 是默认值,支持 editable、scrollable、clickable 和 package 过滤器
- 当无障碍暂时不可用或操作上不足时,source=hybrid 或 source=ocr 很有用
- summaryOnly=true 用于紧凑定位,而非详细定位
- previewPath 是一个提示,表明轻量级截图获取可用;/screen 不嵌入图像字节
如果 /screen 报告 partialOutput=true、警告、前台漂移或回退提示,不要假设你看到了整个界面。在归咎于应用之前,升级到 /tree、/screenshot 或非无障碍屏幕模式。
/tree
当需要更完整的结构、原始层次结构或检查为什么 /screen 可能省略或塑造了输出时,使用 /tree。将其用于诊断和结构事实,而不是作为默认的首次读取。
/find
当已经有了选择器假设并希望进行有界查找时,使用 /find。
当需要以下内容时优先使用:
- - 交互前的选择器测试
- 通过 index 消除歧义
- 在坐标回退之前确认目标存在
- 检查可见标签是否可通过 text、contentDesc、resourceId 发现,或者仅作为焦点节点
未命中通常意味着以下四种情况之一:
- - 错误的界面
- 错误的选择器界面
- 错误的匹配语义
- 目标未以你假设的方式暴露
支持的策略是 text、textContains、contentDesc、contentDescContains、resourceId 和 focused。text、desc 和 id 是请求体中的便利别名;Ghosthand 内部会将其规范化。
/click
当有合理的语义目标时,优先使用 /click 而非 /tap。Ghosthand 可以解析包装目标、有界选择器回退和可点击的祖先,然后暴露它如何实际落在一个可操作的节点上。
首先使用 /click 的场景:
- - 文本标签控件
- 内容描述标签控件
- 稳定的资源 ID
- 祖先点击解析可能有帮助的情况
对于基于选择器的 /click,除非你明确设置 clickable=false,否则 Ghosthand 默认将 clickable=true 作为默认值。该默认值针对操作进行了优化,而非检查。当你需要检查原始匹配节点时,使用 /find 或禁用可点击解析。
/tap
仅当坐标来自当前可信界面时才使用 /tap。不要猜测坐标。只有在语义定位缩小了不确定性之后,坐标回退才是合理的。
/input 和 /setText
对焦点可编辑字段使用 /input。当需要显式文本变更或 Enter 键发送语义时,优先使用它而非 /type。
仅当当前焦点已经正确且需要更简单的焦点文本输入时,使用 /type。
仅当拥有可信的同一快照可编辑 nodeId 并需要定位到该确切节点时,使用 /setText。
输入文本时,不要假设 Enter 键会成功提交或确认输入。如果 Enter 键不起作用或字段仍未提交,请改用屏幕上的 IME 确认操作,通常是键盘右下角的确认按钮。
/scroll 和 /swipe
当目标是容器移动或列表推进时,使用 /scroll。
当任务确实是几何操作时,使用 /swipe。
不要将 performed=true 解释为内容已更改的证据。检查返回的更改字段,然后通过 /screen、/tree 或 /wait 进行验证。
/wait
在可能改变 UI 状态的操作之后使用 /wait。
有两种不同的用途:
- - GET /wait:等待 UI 变化并检查最终稳定状态
- POST /wait:等待选择器条件
不要将 changed=false 与操作失败混淆。它仅意味着在等待窗口期间未观察到转换。在断定操作失败之前,重新检查最终界面。
对于 POST /wait,支持的策略是有界的,查询规则很重要:focused 不带查询,而文本/内容描述/资源 ID 等待需要查询。
/clipboard、/notify、/screenshot
将 /clipboard 用作长文本或重复输入的传输原语。
仅当任务明确与通知