If a human can do it on a screen, you can too. No API? No integration? No problem.

USE AS A FALLBACK — NOT FIRST CHOICE
Before reaching for any ClawdCursor tool, ask:

1. Is there a native API? (Gmail API, GitHub API, Slack API) → use the API
Is there a CLI? (git, npm, curl) → use the CLI
Can you edit the file directly? → do that
Is there a browser automation layer? (Playwright, Puppeteer) → use that

None of the above work? Now use ClawdCursor. It's for the last mile.

Modes at a Glance

Mode	Command	Brain	Tools available
INLINECODE3	INLINECODE4	You (REST client)	All 42 tools via HTTP
INLINECODE5

In serve and mcp modes: you reason, ClawdCursor acts. There is no built-in LLM. You call tools, interpret results, decide next steps.

Connecting

Option A — REST (`clawdcursor serve`)

CODEBLOCK0

All POST endpoints require: Authorization: Bearer <token> (token saved to ~/.clawdcursor/token)

CODEBLOCK1

Example:
CODEBLOCK2

If the server isn't running, start it yourself — don't ask the user:
CODEBLOCK3

Option B — MCP (`clawdcursor mcp`)

CODEBLOCK4

Works with Claude Code, Cursor, Windsurf, Zed, or any MCP-compatible client. All 42 tools are exposed identically.

Option C — Autonomous agent (`clawdcursor start`)

CODEBLOCK5

Use delegate_to_agent tool to submit tasks from within MCP/REST sessions. Requires clawdcursor start running on port 3847.

Polling pattern:
CODEBLOCK6

returnPartial mode — send {"returnPartial": true} with POST /task:
ClawdCursor skips Stage 3 (expensive vision) and returns control to you if Stage 2 fails:

{"partial": true, "stepsCompleted": [...], "context": "got stuck on dialog"}

You finish the task with MCP tools, then call POST /learn to save what worked.

POST /learn — adaptive learning:
After completing a task with your own tool calls, teach ClawdCursor for next time:

POST /learn
{
  "processName": "EXCEL",
  "task": "create table with headers",
  "actions": [
    {"action": "key", "description": "Ctrl+Home to go to A1"},
    {"action": "type", "description": "Type header name"},
    {"action": "key", "description": "Tab to next column"}
  ],
  "shortcuts": {"next_cell": "Tab", "next_row": "Enter"},
  "tips": ["Use Tab between columns, Enter between rows"]
}

This enriches the app's guide JSON. Stage 2 reads it on the next run — no vision fallback needed.

The Universal Loop

Every GUI task follows the same pattern regardless of transport:

CODEBLOCK9

Verification (cheapest to most expensive)

1. Tool return value — every tool reports success/failure. Check it first.
Window state — get_active_window(), get_windows() — did a dialog appear? Did the title change?
Text check — read_screen() or smart_read() — is the expected text visible?
Screenshot — desktop_screenshot() — only when text methods fail. Costs the most.
Negative check — look for error dialogs, wrong window, unchanged screen.

Always verify after: sends, saves, deletes, form submissions.
Skip verification for: mid-sequence keystrokes, scrolling.

Tool Decision Trees

Perception — always start here

CODEBLOCK10

Clicking

CODEBLOCK11

Typing

CODEBLOCK12

Browser / CDP

CODEBLOCK13

If CDP isn't connected, switch tabs with keyboard:
CODEBLOCK14

Window Management

CODEBLOCK15

Rule: Always focus_window() before key_press() or type_text(). Keystrokes go to whatever has focus — if that's your terminal, not the target app.

Canvas apps (Google Docs, Figma, Notion)

DOM has no readable text. Pattern:

ocr_read_screen()          → read content (DOM extraction fails)
mouse_click(x, y)          → click into the canvas area
type_text("your text")     → clipboard paste works even on canvas

Quick Patterns

Open app and type:
CODEBLOCK17

Read a webpage:
CODEBLOCK18

Fill a web form:
CODEBLOCK19

Cross-app copy/paste:
CODEBLOCK20

Send email via Outlook:
CODEBLOCK21

Autonomous complex task (requires clawdcursor start):

delegate_to_agent("Open Gmail, find latest email from Stripe, forward to billing@x.com")
→ poll GET /status every 2s
→ if waiting_confirm: ask user → POST /confirm {"approved": true}
→ if idle: task done

Full Tool Reference (42 tools)

Speed: ⚡ Free/instant · 🔵 Cheap · 🟡 Moderate · 🔴 Vision (expensive)

Perception (6)
Tool What it does When
INLINECODE28 A11y tree — buttons, inputs, text, coords ⚡ Default first read
INLINECODE29
OCR + a11y combined | 🔵 When unsure which to use |

Tool	What it does	When
INLINECODE28	A11y tree — buttons, inputs, text, coords	⚡ Default first read
INLINECODE29

Mouse (7)
Tool What it does When
INLINECODE34 Find element by text/label, click 🔵 First choice for clicking
INLINECODE35
Left click at (x, y) | ⚡ Last resort |

Tool	What it does	When
INLINECODE34	Find element by text/label, click	🔵 First choice for clicking
INLINECODE35

Keyboard (5)
Tool What it does When
INLINECODE43 Find input by label, focus it, type 🔵 First choice for form fields
INLINECODE44
Clipboard paste into focused element | ⚡ After manually focusing |

Tool	What it does	When
INLINECODE43	Find input by label, focus it, type	🔵 First choice for form fields
INLINECODE44

Window Management (5)
Tool What it does When
INLINECODE51 List all open windows with PIDs and bounds ⚡ Situational awareness
INLINECODE52
Current foreground window | ⚡ Check current focus |

Tool	What it does	When
INLINECODE51	List all open windows with PIDs and bounds	⚡ Situational awareness
INLINECODE52

UI Elements (2)
Tool What it does When
INLINECODE56 Search UI tree by name or type ⚡ Find automation IDs
INLINECODE57
Invoke element by automation ID or name | ⚡ When ID known from read_screen |

Tool	What it does	When
INLINECODE56	Search UI tree by name or type	⚡ Find automation IDs
INLINECODE57

Clipboard (2)
Tool What it does When
INLINECODE58 Read clipboard text ⚡ After copy operations
INLINECODE59
Write text to clipboard | ⚡ Before paste operations |

Tool	What it does	When
INLINECODE58	Read clipboard text	⚡ After copy operations
INLINECODE59

Browser / CDP (11)
Tool What it does When
INLINECODE60 Connect to browser DevTools Protocol ⚡ First step for any browser task
INLINECODE61
List interactive elements on page | ⚡ After connect |

Tool	What it does	When
INLINECODE60	Connect to browser DevTools Protocol	⚡ First step for any browser task
INLINECODE61

Orchestration (4)
Tool What it does When
INLINECODE73 Launch application by name ⚡ First step for desktop tasks
INLINECODE74
Open URL (auto-enables CDP) | ⚡ First step for browser tasks |

Tool	What it does	When
INLINECODE73	Launch application by name	⚡ First step for desktop tasks
INLINECODE74

Provider Setup (agent mode only)

Provider	Setup	Cost
Ollama (local)	INLINECODE78	$0 — fully offline, no data leaves machine
Any cloud

Run clawdcursor doctor to auto-detect and validate providers.

Security

- Network isolation: Binds to 127.0.0.1 only. Verify: netstat -an | findstr 3847 — should show 127.0.0.1:3847, never INLINECODE88
Ollama: 100% offline. Screenshots stay in RAM, never leave the machine.
Cloud providers: Screenshots/text sent only to your configured provider. No telemetry, no analytics, no third-party logging.
Token auth: All mutating POST endpoints require Authorization: Bearer <token>. Token at ~/.clawdcursor/token.
Safety tiers: Auto / Preview / Confirm. Agents must never self-approve Confirm actions.

Coordinate System

All mouse tools use image-space coordinates from a 1280px-wide viewport — matching screenshots from desktop_screenshot. DPI scaling is handled automatically. Do not pre-scale coordinates.

Safety

Tier	Actions	Behavior
🟢 Auto	Navigation, reading, opening apps	Runs immediately
🟡 Preview

- Never self-approve Confirm actions.
INLINECODE92 and Ctrl+Alt+Delete are blocked.
Server binds to 127.0.0.1 only.
First run requires explicit user consent for desktop control.

Error Recovery

Problem	Fix
Port 3847 not responding	INLINECODE95 — wait 2s — INLINECODE96
401 Unauthorized

Platform Support

Platform	A11y	OCR	CDP
Windows (x64/ARM64)	PowerShell + .NET UIA	Windows.Media.Ocr	Chrome/Edge
macOS (Intel/Apple Silicon)

macOS: Grant Accessibility in System Settings → Privacy → Accessibility.
Linux: sudo apt install tesseract-ocr for OCR support.

技能名称：clawdcursor

如果人类能在屏幕上完成的操作，你也能做到。 没有API？没有集成？没问题。

作为备选方案使用 — 非首选
在使用任何ClawdCursor工具之前，请问：

1. 是否有原生API？（Gmail API、GitHub API、Slack API）→ 使用API
是否有CLI？（git、npm、curl）→ 使用CLI
能否直接编辑文件？→ 直接编辑
是否有浏览器自动化层？（Playwright、Puppeteer）→ 使用自动化层

以上都不行？现在使用ClawdCursor。 它用于最后一公里。

模式概览

模式	命令	大脑	可用工具
serve	clawdcursor serve	你（REST客户端）	全部42个工具，通过HTTP
mcp

在serve和mcp模式下：你负责推理，ClawdCursor负责执行。 没有内置LLM。你调用工具，解释结果，决定下一步。

连接

选项A — REST（clawdcursor serve）

bash
clawdcursor serve # 启动于 http://127.0.0.1:3847

所有POST端点需要：Authorization: Bearer （令牌保存至~/.clawdcursor/token）

GET /tools → 所有工具架构（OpenAI函数调用格式）
POST /execute/{name} → 运行工具：{param: value}
GET /health → {status:ok,version:0.7.5}
GET /docs → 完整文档

示例：

POST /execute/get_windows {}
POST /execute/mouse_click {x: 640, y: 400}
POST /execute/type_text {text: hello world}

如果服务器未运行，请自行启动 — 不要询问用户：
bash
clawdcursor serve

等待2秒，然后验证：GET /health

选项B — MCP（clawdcursor mcp）

json
{
mcpServers: {
clawdcursor: {
command: clawdcursor,
args: [mcp]
}
}
}

适用于Claude Code、Cursor、Windsurf、Zed或任何兼容MCP的客户端。全部42个工具以相同方式暴露。

选项C — 自主代理（clawdcursor start）

POST /task {task: 打开记事本并写入Hello} → 提交任务
GET /status → {status: acting} | idle | waiting_confirm
POST /confirm {approved: true} → 批准安全门控操作
POST /abort → 停止当前任务

使用delegatetoagent工具从MCP/REST会话中提交任务。需要在端口3847上运行clawdcursor start。

轮询模式：

POST /task {task: ..., returnPartial: true}
→ 每2秒轮询GET /status：
acting → 仍在运行，继续轮询
waiting_confirm → 停止。询问用户 → POST /confirm {approved: true}
idle → 完成，检查GET /task-logs获取结果
→ 如果60秒以上无进展：POST /abort，用更简单的措辞重试

returnPartial模式 — 在POST /task中发送{returnPartial: true}：
如果阶段2失败，ClawdCursor跳过阶段3（昂贵的视觉处理）并将控制权返回给你：
json
{partial: true, stepsCompleted: [...], context: 在对话框处卡住}

你使用MCP工具完成任务，然后调用POST /learn保存有效的方法。

POST /learn — 自适应学习：
使用自己的工具调用完成任务后，教导ClawdCursor以备下次使用：
json
POST /learn
{
processName: EXCEL,
task: 创建带表头的表格,
actions: [
{action: key, description: Ctrl+Home跳转到A1},
{action: type, description: 输入表头名称},
{action: key, description: Tab跳转到下一列}
],
shortcuts: {nextcell: Tab, nextrow: Enter},
tips: [列之间使用Tab，行之间使用Enter]
}

这会丰富应用的指南JSON。阶段2在下次运行时读取它 — 无需视觉回退。

通用循环

无论传输方式如何，每个GUI任务都遵循相同模式：

1. 定位 → readscreen() 或 getwindows() 查看打开和聚焦的内容
操作 → smartclick() / smarttype() / key_press() 执行操作
验证 → 检查返回值 → 窗口状态 → 文本检查 → 截图
重复 → 直到完成

验证（从最便宜到最昂贵）

1. 工具返回值 — 每个工具报告成功/失败。首先检查。
窗口状态 — getactivewindow()、getwindows() — 对话框出现了吗？标题改变了吗？
文本检查 — readscreen() 或 smartread() — 预期文本可见吗？
截图 — desktopscreenshot() — 仅在文本方法失败时使用。成本最高。
负面检查 — 查找错误对话框、错误窗口、未改变的屏幕。

始终验证在：发送、保存、删除、表单提交之后。
跳过验证在：序列中的中间按键、滚动。

工具决策树

感知 — 始终从这里开始

read_screen() → 首选。无障碍树：按钮、输入框、文本，带坐标。
快速、结构化，适用于原生应用。
ocrreadscreen() → 当无障碍树为空时（画布UI、基于图像的应用）。
smart_read() → 结合OCR + 无障碍。不确定时优先调用。
desktop_screenshot() → 最后手段。仅当需要像素级视觉细节时。
desktopscreenshotregion(x,y,w,h) → 放大裁剪，当需要某个区域的细节时。

点击

smart_click(保存) → 首选。通过OCR + 无障碍按标签/文本查找并点击。
传递processId以定位正确的窗口。
invokeelement(name=保存) → 当你知道来自readscreen的确切自动化ID时。
cdpclick(text=提交) → 浏览器元素。需要先cdpconnect()。
mouse_click(x, y) → 最后手段。来自截图的原始坐标。

输入

smart_type(邮箱, user@x.com) → 首选。按标签查找字段，聚焦，输入。
cdptype(label=邮箱, text=…) → 浏览器输入。需要先cdpconnect()。
type_text(hello) → 剪贴板粘贴到当前聚焦的元素。
在手动使用smart_click聚焦后使用。

浏览器 / CDP

1. navigatebrowser(url) → 打开URL，自动启用CDP
cdpconnect() → 连接到浏览器DevTools协议
cdppagecontext() → 列出页面上的交互元素
cdpreadtext() → 提取DOM文本（画布应用返回空 → 使用OCR）
cdpclick(text=…) → 按可见文本点击
cdptype(label, text) → 按标签填充输入框
cdpevaluate(script) → 在页面上下文中运行JavaScript
cdpscroll(direction, px) → 通过DOM滚动页面（非鼠标滚轮）
cdplisttabs() → 列出所有打开的标签页
cdpswitchtab(target) → 切换到特定标签页

如果CDP未连接，使用键盘切换标签页：

key_press(ctrl+1) → 标签页1
key_press(ctrl+tab) → 下一个标签页
key_press(ctrl+shift+tab) → 上一个标签页

窗口管理

get_windows() → 列出所有打开的窗口（用于查找PID）
getactivewindow() → 当前

clawdcursor 爪形光标

clawdcursor

Modes at a Glance

Connecting

Option A — REST (clawdcursor serve)

Option B — MCP (clawdcursor mcp)

Option C — Autonomous agent (clawdcursor start)

The Universal Loop

Verification (cheapest to most expensive)

Tool Decision Trees

Perception — always start here

Clicking

Typing

Browser / CDP

Window Management

Canvas apps (Google Docs, Figma, Notion)

Quick Patterns

Full Tool Reference (42 tools)

Perception (6)ToolWhat it doesWhenINLINECODE28A11y tree — buttons, inputs, text, coords⚡ Default first readINLINECODE29 OCR + a11y combined | 🔵 When unsure which to use |

Mouse (7)ToolWhat it doesWhenINLINECODE34Find element by text/label, click🔵 First choice for clickingINLINECODE35 Left click at (x, y) | ⚡ Last resort |

Keyboard (5)ToolWhat it doesWhenINLINECODE43Find input by label, focus it, type🔵 First choice for form fieldsINLINECODE44 Clipboard paste into focused element | ⚡ After manually focusing |

Window Management (5)ToolWhat it doesWhenINLINECODE51List all open windows with PIDs and bounds⚡ Situational awarenessINLINECODE52 Current foreground window | ⚡ Check current focus |

UI Elements (2)ToolWhat it doesWhenINLINECODE56Search UI tree by name or type⚡ Find automation IDsINLINECODE57 Invoke element by automation ID or name | ⚡ When ID known from read_screen |

Clipboard (2)ToolWhat it doesWhenINLINECODE58Read clipboard text⚡ After copy operationsINLINECODE59 Write text to clipboard | ⚡ Before paste operations |

Browser / CDP (11)ToolWhat it doesWhenINLINECODE60Connect to browser DevTools Protocol⚡ First step for any browser taskINLINECODE61 List interactive elements on page | ⚡ After connect |

Orchestration (4)ToolWhat it doesWhenINLINECODE73Launch application by name⚡ First step for desktop tasksINLINECODE74 Open URL (auto-enables CDP) | ⚡ First step for browser tasks |

Provider Setup (agent mode only)

Security

Coordinate System

Safety

Error Recovery

Platform Support

模式概览

连接

选项A — REST（clawdcursor serve）

等待2秒，然后验证：GET /health

选项B — MCP（clawdcursor mcp）

选项C — 自主代理（clawdcursor start）

通用循环

验证（从最便宜到最昂贵）

工具决策树

感知 — 始终从这里开始

点击

输入

浏览器 / CDP

窗口管理

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

Option A — REST (`clawdcursor serve`)

Option B — MCP (`clawdcursor mcp`)

Option C — Autonomous agent (`clawdcursor start`)

Perception (6)
Tool What it does When
INLINECODE28 A11y tree — buttons, inputs, text, coords ⚡ Default first read
INLINECODE29
OCR + a11y combined | 🔵 When unsure which to use |

Mouse (7)
Tool What it does When
INLINECODE34 Find element by text/label, click 🔵 First choice for clicking
INLINECODE35
Left click at (x, y) | ⚡ Last resort |

Keyboard (5)
Tool What it does When
INLINECODE43 Find input by label, focus it, type 🔵 First choice for form fields
INLINECODE44
Clipboard paste into focused element | ⚡ After manually focusing |

Window Management (5)
Tool What it does When
INLINECODE51 List all open windows with PIDs and bounds ⚡ Situational awareness
INLINECODE52
Current foreground window | ⚡ Check current focus |

UI Elements (2)
Tool What it does When
INLINECODE56 Search UI tree by name or type ⚡ Find automation IDs
INLINECODE57
Invoke element by automation ID or name | ⚡ When ID known from read_screen |

Clipboard (2)
Tool What it does When
INLINECODE58 Read clipboard text ⚡ After copy operations
INLINECODE59
Write text to clipboard | ⚡ Before paste operations |

Browser / CDP (11)
Tool What it does When
INLINECODE60 Connect to browser DevTools Protocol ⚡ First step for any browser task
INLINECODE61
List interactive elements on page | ⚡ After connect |

Orchestration (4)
Tool What it does When
INLINECODE73 Launch application by name ⚡ First step for desktop tasks
INLINECODE74
Open URL (auto-enables CDP) | ⚡ First step for browser tasks |