If a human can do it on a screen, you can too. No API? No integration? No problem.
USE AS A FALLBACK — NOT FIRST CHOICE
Before reaching for any ClawdCursor tool, ask:
- 1. Is there a native API? (Gmail API, GitHub API, Slack API) → use the API
- Is there a CLI? (
git, npm, curl) → use the CLI - Can you edit the file directly? → do that
- Is there a browser automation layer? (Playwright, Puppeteer) → use that
None of the above work? Now use ClawdCursor. It's for the last mile.
Modes at a Glance
| Mode | Command | Brain | Tools available |
|---|
| INLINECODE3 | INLINECODE4 | You (REST client) | All 42 tools via HTTP |
| INLINECODE5 |
clawdcursor mcp |
You (MCP client) | All 42 tools via MCP stdio |
|
start |
clawdcursor start | Built-in LLM pipeline | All 42 tools + autonomous agent |
In serve and mcp modes: you reason, ClawdCursor acts. There is no built-in LLM. You call tools, interpret results, decide next steps.
Connecting
Option A — REST (clawdcursor serve)
CODEBLOCK0
All POST endpoints require: Authorization: Bearer <token> (token saved to ~/.clawdcursor/token)
CODEBLOCK1
Example:
CODEBLOCK2
If the server isn't running, start it yourself — don't ask the user:
CODEBLOCK3
Option B — MCP (clawdcursor mcp)
CODEBLOCK4
Works with Claude Code, Cursor, Windsurf, Zed, or any MCP-compatible client. All 42 tools are exposed identically.
Option C — Autonomous agent (clawdcursor start)
CODEBLOCK5
Use delegate_to_agent tool to submit tasks from within MCP/REST sessions. Requires clawdcursor start running on port 3847.
Polling pattern:
CODEBLOCK6
returnPartial mode — send {"returnPartial": true} with POST /task:
ClawdCursor skips Stage 3 (expensive vision) and returns control to you if Stage 2 fails:
{"partial": true, "stepsCompleted": [...], "context": "got stuck on dialog"}
You finish the task with MCP tools, then call POST /learn to save what worked.
POST /learn — adaptive learning:
After completing a task with your own tool calls, teach ClawdCursor for next time:
POST /learn
{
"processName": "EXCEL",
"task": "create table with headers",
"actions": [
{"action": "key", "description": "Ctrl+Home to go to A1"},
{"action": "type", "description": "Type header name"},
{"action": "key", "description": "Tab to next column"}
],
"shortcuts": {"next_cell": "Tab", "next_row": "Enter"},
"tips": ["Use Tab between columns, Enter between rows"]
}
This enriches the app's guide JSON. Stage 2 reads it on the next run — no vision fallback needed.
The Universal Loop
Every GUI task follows the same pattern regardless of transport:
CODEBLOCK9
Verification (cheapest to most expensive)
- 1. Tool return value — every tool reports success/failure. Check it first.
- Window state —
get_active_window(), get_windows() — did a dialog appear? Did the title change? - Text check —
read_screen() or smart_read() — is the expected text visible? - Screenshot —
desktop_screenshot() — only when text methods fail. Costs the most. - Negative check — look for error dialogs, wrong window, unchanged screen.
Always verify after: sends, saves, deletes, form submissions.
Skip verification for: mid-sequence keystrokes, scrolling.
Tool Decision Trees
Perception — always start here
CODEBLOCK10
Clicking
CODEBLOCK11
Typing
CODEBLOCK12
Browser / CDP
CODEBLOCK13
If CDP isn't connected, switch tabs with keyboard:
CODEBLOCK14
Window Management
CODEBLOCK15
Rule: Always focus_window() before key_press() or type_text(). Keystrokes go to whatever has focus — if that's your terminal, not the target app.
Canvas apps (Google Docs, Figma, Notion)
DOM has no readable text. Pattern:
ocr_read_screen() → read content (DOM extraction fails)
mouse_click(x, y) → click into the canvas area
type_text("your text") → clipboard paste works even on canvas
Quick Patterns
Open app and type:
CODEBLOCK17
Read a webpage:
CODEBLOCK18
Fill a web form:
CODEBLOCK19
Cross-app copy/paste:
CODEBLOCK20
Send email via Outlook:
CODEBLOCK21
Autonomous complex task (requires clawdcursor start):
delegate_to_agent("Open Gmail, find latest email from Stripe, forward to billing@x.com")
→ poll GET /status every 2s
→ if waiting_confirm: ask user → POST /confirm {"approved": true}
→ if idle: task done
Full Tool Reference (42 tools)
Speed: ⚡ Free/instant · 🔵 Cheap · 🟡 Moderate · 🔴 Vision (expensive)
Perception (6)
| Tool | What it does | When |
|---|
| INLINECODE28 | A11y tree — buttons, inputs, text, coords | ⚡ Default first read |
| INLINECODE29 |
OCR + a11y combined | 🔵 When unsure which to use |
|
ocr_read_screen | Raw OCR text with bounding boxes | 🔵 Canvas UIs, empty a11y trees |
|
desktop_screenshot | Full screen image (1280px wide) | ⚡ Last resort visual check |
|
desktop_screenshot_region | Zoomed crop of specific area | ⚡ Fine-grained visual detail |
|
get_screen_size | Screen dimensions and DPI | ⚡ Coordinate calculations |
Mouse (7)
| Tool | What it does | When |
|---|
| INLINECODE34 | Find element by text/label, click | 🔵 First choice for clicking |
| INLINECODE35 |
Left click at (x, y) | ⚡ Last resort |
|
mouse_double_click | Double click at (x, y) | ⚡ Open files, select words |
|
mouse_right_click | Right click at (x, y) | ⚡ Context menus |
|
mouse_hover | Move cursor without clicking | ⚡ Hover menus |
|
mouse_scroll | Scroll at position (physical mouse wheel) | ⚡ Scroll content |
|
mouse_drag | Drag from start to end — accepts
startX/startY/endX/endY or
x1/y1/x2/y2 | ⚡ Resize, select ranges |
Keyboard (5)
| Tool | What it does | When |
|---|
| INLINECODE43 | Find input by label, focus it, type | 🔵 First choice for form fields |
| INLINECODE44 |
Clipboard paste into focused element | ⚡ After manually focusing |
|
key_press | Send key combo (
ctrl+s,
Return,
alt+tab) | ⚡ After focus_window |
|
shortcuts_list | List keyboard shortcuts for current app | ⚡ Before reaching for mouse |
|
shortcuts_execute | Run a named shortcut (fuzzy match) | ⚡ Save, copy, paste, undo |
Window Management (5)
| Tool | What it does | When |
|---|
| INLINECODE51 | List all open windows with PIDs and bounds | ⚡ Situational awareness |
| INLINECODE52 |
Current foreground window | ⚡ Check current focus |
|
get_focused_element | Element with keyboard focus | ⚡ Debug wrong-field typing |
|
focus_window | Bring window to front (auto-clears off-screen phantoms) | ⚡ Always before key_press |
|
minimize_window | Minimize by processName, processId, or title | ⚡ Clear focus stealers |
UI Elements (2)
| Tool | What it does | When |
|---|
| INLINECODE56 | Search UI tree by name or type | ⚡ Find automation IDs |
| INLINECODE57 |
Invoke element by automation ID or name | ⚡ When ID known from read_screen |
Clipboard (2)
| Tool | What it does | When |
|---|
| INLINECODE58 | Read clipboard text | ⚡ After copy operations |
| INLINECODE59 |
Write text to clipboard | ⚡ Before paste operations |
Browser / CDP (11)
| Tool | What it does | When |
|---|
| INLINECODE60 | Connect to browser DevTools Protocol | ⚡ First step for any browser task |
| INLINECODE61 |
List interactive elements on page | ⚡ After connect |
|
cdp_read_text | Extract DOM text | ⚡ Read page content |
|
cdp_click | Click by CSS selector or visible text | ⚡ Browser clicks |
|
cdp_type | Type into input by label or selector | ⚡ Browser form filling |
|
cdp_select_option | Select dropdown option | ⚡ Select elements |
|
cdp_evaluate | Run JavaScript in page context | ⚡ Custom queries |
|
cdp_scroll | Scroll page via DOM (
direction,
amount px) | ⚡ DOM-level scroll |
|
cdp_wait_for_selector | Wait for element to appear | ⚡ After navigation/AJAX |
|
cdp_list_tabs | List all browser tabs | ⚡ When on wrong tab |
|
cdp_switch_tab | Switch to a tab by title or index | ⚡ After cdp
listtabs |
Orchestration (4)
| Tool | What it does | When |
|---|
| INLINECODE73 | Launch application by name | ⚡ First step for desktop tasks |
| INLINECODE74 |
Open URL (auto-enables CDP) | ⚡ First step for browser tasks |
|
wait | Pause N seconds | ⚡ After opening apps, let UI render |
|
delegate_to_agent | Send task to built-in autonomous agent | 🟡 Complex multi-step tasks (requires
clawdcursor start) |
Provider Setup (agent mode only)
| Provider | Setup | Cost |
|---|
| Ollama (local) | INLINECODE78 | $0 — fully offline, no data leaves machine |
| Any cloud |
Set env var:
ANTHROPIC_API_KEY,
OPENAI_API_KEY,
GEMINI_API_KEY,
MOONSHOT_API_KEY, etc. | Varies |
|
OpenClaw users | Auto-detected from
~/.openclaw/agents/main/auth-profiles.json | No extra setup |
Run clawdcursor doctor to auto-detect and validate providers.
Security
- - Network isolation: Binds to
127.0.0.1 only. Verify: netstat -an | findstr 3847 — should show 127.0.0.1:3847, never INLINECODE88 - Ollama: 100% offline. Screenshots stay in RAM, never leave the machine.
- Cloud providers: Screenshots/text sent only to your configured provider. No telemetry, no analytics, no third-party logging.
- Token auth: All mutating POST endpoints require
Authorization: Bearer <token>. Token at ~/.clawdcursor/token. - Safety tiers: Auto / Preview / Confirm. Agents must never self-approve Confirm actions.
Coordinate System
All mouse tools use image-space coordinates from a 1280px-wide viewport — matching screenshots from desktop_screenshot. DPI scaling is handled automatically. Do not pre-scale coordinates.
Safety
| Tier | Actions | Behavior |
|---|
| 🟢 Auto | Navigation, reading, opening apps | Runs immediately |
| 🟡 Preview |
Typing, form filling | Logged |
| 🔴 Confirm | Send, delete, purchase | Pauses —
always ask user first |
- - Never self-approve Confirm actions.
- INLINECODE92 and
Ctrl+Alt+Delete are blocked. - Server binds to
127.0.0.1 only. - First run requires explicit user consent for desktop control.
Error Recovery
| Problem | Fix |
|---|
| Port 3847 not responding | INLINECODE95 — wait 2s — INLINECODE96 |
| 401 Unauthorized |
Token changed — read
~/.clawdcursor/token and use fresh value |
| CDP not available | Chrome must be open.
navigate_browser(url) auto-enables it. |
| CDP on wrong tab |
cdp_list_tabs() →
cdp_switch_tab(target) |
|
focus_window fails |
get_windows() to confirm title/processName, then retry |
|
smart_click can't find element |
read_screen() for coords →
mouse_click(x, y) |
|
key_press goes to wrong window | You skipped
focus_window — always focus first |
|
cdp_read_text returns empty | Canvas app — use
ocr_read_screen() instead |
| Same action fails 3+ times | Try a completely different approach |
Platform Support
| Platform | A11y | OCR | CDP |
|---|
| Windows (x64/ARM64) | PowerShell + .NET UIA | Windows.Media.Ocr | Chrome/Edge |
| macOS (Intel/Apple Silicon) |
JXA + System Events | Apple Vision | Chrome/Edge |
| Linux (x64/ARM64) | AT-SPI | Tesseract | Chrome/Edge |
macOS: Grant Accessibility in System Settings → Privacy → Accessibility.
Linux: sudo apt install tesseract-ocr for OCR support.
技能名称:clawdcursor
如果人类能在屏幕上完成的操作,你也能做到。 没有API?没有集成?没问题。
作为备选方案使用 — 非首选
在使用任何ClawdCursor工具之前,请问:
- 1. 是否有原生API?(Gmail API、GitHub API、Slack API)→ 使用API
- 是否有CLI?(git、npm、curl)→ 使用CLI
- 能否直接编辑文件?→ 直接编辑
- 是否有浏览器自动化层?(Playwright、Puppeteer)→ 使用自动化层
以上都不行?现在使用ClawdCursor。 它用于最后一公里。
模式概览
| 模式 | 命令 | 大脑 | 可用工具 |
|---|
| serve | clawdcursor serve | 你(REST客户端) | 全部42个工具,通过HTTP |
| mcp |
clawdcursor mcp |
你(MCP客户端) | 全部42个工具,通过MCP stdio |
| start | clawdcursor start | 内置LLM管道 | 全部42个工具 + 自主代理 |
在serve和mcp模式下:你负责推理,ClawdCursor负责执行。 没有内置LLM。你调用工具,解释结果,决定下一步。
连接
选项A — REST(clawdcursor serve)
bash
clawdcursor serve # 启动于 http://127.0.0.1:3847
所有POST端点需要:Authorization: Bearer (令牌保存至~/.clawdcursor/token)
GET /tools → 所有工具架构(OpenAI函数调用格式)
POST /execute/{name} → 运行工具:{param: value}
GET /health → {status:ok,version:0.7.5}
GET /docs → 完整文档
示例:
POST /execute/get_windows {}
POST /execute/mouse_click {x: 640, y: 400}
POST /execute/type_text {text: hello world}
如果服务器未运行,请自行启动 — 不要询问用户:
bash
clawdcursor serve
等待2秒,然后验证:GET /health
选项B — MCP(clawdcursor mcp)
json
{
mcpServers: {
clawdcursor: {
command: clawdcursor,
args: [mcp]
}
}
}
适用于Claude Code、Cursor、Windsurf、Zed或任何兼容MCP的客户端。全部42个工具以相同方式暴露。
选项C — 自主代理(clawdcursor start)
POST /task {task: 打开记事本并写入Hello} → 提交任务
GET /status → {status: acting} | idle | waiting_confirm
POST /confirm {approved: true} → 批准安全门控操作
POST /abort → 停止当前任务
使用delegatetoagent工具从MCP/REST会话中提交任务。需要在端口3847上运行clawdcursor start。
轮询模式:
POST /task {task: ..., returnPartial: true}
→ 每2秒轮询GET /status:
acting → 仍在运行,继续轮询
waiting_confirm → 停止。询问用户 → POST /confirm {approved: true}
idle → 完成,检查GET /task-logs获取结果
→ 如果60秒以上无进展:POST /abort,用更简单的措辞重试
returnPartial模式 — 在POST /task中发送{returnPartial: true}:
如果阶段2失败,ClawdCursor跳过阶段3(昂贵的视觉处理)并将控制权返回给你:
json
{partial: true, stepsCompleted: [...], context: 在对话框处卡住}
你使用MCP工具完成任务,然后调用POST /learn保存有效的方法。
POST /learn — 自适应学习:
使用自己的工具调用完成任务后,教导ClawdCursor以备下次使用:
json
POST /learn
{
processName: EXCEL,
task: 创建带表头的表格,
actions: [
{action: key, description: Ctrl+Home跳转到A1},
{action: type, description: 输入表头名称},
{action: key, description: Tab跳转到下一列}
],
shortcuts: {nextcell: Tab, nextrow: Enter},
tips: [列之间使用Tab,行之间使用Enter]
}
这会丰富应用的指南JSON。阶段2在下次运行时读取它 — 无需视觉回退。
通用循环
无论传输方式如何,每个GUI任务都遵循相同模式:
- 1. 定位 → readscreen() 或 getwindows() 查看打开和聚焦的内容
- 操作 → smartclick() / smarttype() / key_press() 执行操作
- 验证 → 检查返回值 → 窗口状态 → 文本检查 → 截图
- 重复 → 直到完成
验证(从最便宜到最昂贵)
- 1. 工具返回值 — 每个工具报告成功/失败。首先检查。
- 窗口状态 — getactivewindow()、getwindows() — 对话框出现了吗?标题改变了吗?
- 文本检查 — readscreen() 或 smartread() — 预期文本可见吗?
- 截图 — desktopscreenshot() — 仅在文本方法失败时使用。成本最高。
- 负面检查 — 查找错误对话框、错误窗口、未改变的屏幕。
始终验证在:发送、保存、删除、表单提交之后。
跳过验证在:序列中的中间按键、滚动。
工具决策树
感知 — 始终从这里开始
read_screen() → 首选。无障碍树:按钮、输入框、文本,带坐标。
快速、结构化,适用于原生应用。
ocrreadscreen() → 当无障碍树为空时(画布UI、基于图像的应用)。
smart_read() → 结合OCR + 无障碍。不确定时优先调用。
desktop_screenshot() → 最后手段。仅当需要像素级视觉细节时。
desktopscreenshotregion(x,y,w,h) → 放大裁剪,当需要某个区域的细节时。
点击
smart_click(保存) → 首选。通过OCR + 无障碍按标签/文本查找并点击。
传递processId以定位正确的窗口。
invokeelement(name=保存) → 当你知道来自readscreen的确切自动化ID时。
cdpclick(text=提交) → 浏览器元素。需要先cdpconnect()。
mouse_click(x, y) → 最后手段。来自截图的原始坐标。
输入
smart_type(邮箱, user@x.com) → 首选。按标签查找字段,聚焦,输入。
cdptype(label=邮箱, text=…) → 浏览器输入。需要先cdpconnect()。
type_text(hello) → 剪贴板粘贴到当前聚焦的元素。
在手动使用smart_click聚焦后使用。
浏览器 / CDP
- 1. navigatebrowser(url) → 打开URL,自动启用CDP
- cdpconnect() → 连接到浏览器DevTools协议
- cdppagecontext() → 列出页面上的交互元素
- cdpreadtext() → 提取DOM文本(画布应用返回空 → 使用OCR)
- cdpclick(text=…) → 按可见文本点击
- cdptype(label, text) → 按标签填充输入框
- cdpevaluate(script) → 在页面上下文中运行JavaScript
- cdpscroll(direction, px) → 通过DOM滚动页面(非鼠标滚轮)
- cdplisttabs() → 列出所有打开的标签页
- cdpswitchtab(target) → 切换到特定标签页
如果CDP未连接,使用键盘切换标签页:
key_press(ctrl+1) → 标签页1
key_press(ctrl+tab) → 下一个标签页
key_press(ctrl+shift+tab) → 上一个标签页
窗口管理
get_windows() → 列出所有打开的窗口(用于查找PID)
getactivewindow() → 当前