Requirements

Before using AgentKVM, ensure the following are installed and available:

- AgentKVM CLI — INLINECODE0
Node.js >= 18
ffmpeg — required for screenshot capture (brew install ffmpeg on macOS, apt install ffmpeg on Linux)
NanoKVM-USB hardware connected to the host machine via USB
HDMI input from the target device connected to the NanoKVM-USB

Run agentkvm status to verify everything is set up correctly. If the CLI is not found, install it first. If the device is not detected, check agentkvm list for available serial ports.

AgentKVM — AI-Driven Device Control

AgentKVM lets you see and operate physical devices (iPhones, Android phones, PCs, Macs, Linux machines) connected via NanoKVM-USB hardware. You take screenshots to observe the screen, then send mouse clicks, keyboard input, and scrolls to interact — just like a human sitting in front of the device.

Core Loop

Every interaction with a physical device follows the same pattern:

CODEBLOCK0

1. Screenshot — capture what's currently on screen
Analyze — look at the image to understand the UI state
Act — click, type, scroll, or drag based on what you see
Verify — take another screenshot to confirm the action worked

This loop is your fundamental building block. Chain multiple iterations to accomplish complex tasks.

Quick Start

Check connection

CODEBLOCK1

If this fails, the device isn't connected. Check the serial port with agentkvm list.

See the screen

CODEBLOCK2

Returns { "path": "/path/to/screenshot.png", ... }. Read the image to see what's on screen.

Interact

CODEBLOCK3

Remote operation

If AgentKVM is running on another machine, all commands work identically with --remote:

CODEBLOCK4

Or use the HTTP API directly — see references/api.md.

How Coordinates Work

This is critical to get right. When you analyze a screenshot and identify a UI element at pixel (x, y), those coordinates are relative to the screenshot image itself — top-left is (0, 0). Pass these coordinates directly to agentkvm mouse click x y.

AgentKVM handles the translation to the actual hardware coordinates internally, based on the device type and crop settings. You don't need to do any math.

Two coordinate modes

The device type determines how coordinates are translated:

"device" mode (iPhone, Android) — The cropped region IS the device's full screen. HID absolute coordinates 0–4096 map to the device's own display. Use this when the HDMI output shows the device screen within a larger capture frame.

"frame" mode (PC, Mac, Linux) — The cropped region is just a visual focus area; HID coordinates still map to the full monitor. Use this when you're controlling a computer where the capture resolution matches the target display.

The mode is selected automatically from the config. You rarely need to think about it.

Implementing a Task

When asked to perform a GUI task (e.g., "open Safari and search for X"):

Step 1: Observe first

Always start with a screenshot. Never assume what's on screen.

CODEBLOCK5

Read the returned image file. Describe what you see — this grounds your actions in reality.

Step 2: Plan your actions

Break the task into individual interactions. For "open Safari and search for X":

1. Find the Safari icon → click it
Wait for Safari to load → screenshot to verify
Find the address bar → click it
Type the search query
Press Enter
Screenshot to verify results

Step 3: Execute with verification

After each significant action, take a screenshot to verify it worked. Screens can be slow to update, so add brief waits between actions when needed (use sleep in your script).

Common pattern in a bash script:

CODEBLOCK6

Step 4: Handle failures

If an action didn't produce the expected result:

- The element might have moved — take a fresh screenshot and re-locate it
The screen might not have updated yet — wait and retry
You might have clicked the wrong spot — re-analyze and adjust coordinates

Config Reference

All settings live in ~/.config/agentkvm/config.json. A typical setup:

CODEBLOCK7

Key fields:

- serialPort — path to the NanoKVM-USB serial device
INLINECODE15 — HDMI capture resolution
INLINECODE16 — video capture device name or index
INLINECODE17 — determines coordinate mode (iphone/android = device, pc/mac/linux = frame)
INLINECODE23 — sub-region of the capture frame to use as the working area

When config is set, you can run bare commands without flags: agentkvm screenshot, agentkvm mouse click 100 200, etc.

Tips for Reliable Automation

Prefer clicking on text labels over icons — text is easier to locate precisely in screenshots.

Use --json for programmatic access — all commands support it and return structured data you can parse.

Double-click when single-click doesn't respond — some UI elements need --double.

Scroll in small increments — --delta 1 or --delta -1 is one scroll step. Use multiple steps with verification screenshots in between.

Type slowly for unreliable connections — increase --delay (default 50ms) if characters get dropped.

Use key combos for navigation — cmd+space (Spotlight), alt+tab (window switch), ctrl+c (cancel) are often faster than finding and clicking UI elements.

For the full CLI reference, key combo syntax, and HTTP API details, see references/api.md.

系统要求

使用 AgentKVM 前，请确保已安装并准备好以下内容：

- AgentKVM CLI — npm install -g agentkvm
Node.js >= 18
ffmpeg — 截图功能所需（macOS 使用 brew install ffmpeg，Linux 使用 apt install ffmpeg）
NanoKVM-USB 硬件通过 USB 连接到主机
目标设备的 HDMI 输出 连接到 NanoKVM-USB

运行 agentkvm status 验证所有配置是否正确。如果找不到 CLI，请先安装。如果未检测到设备，请使用 agentkvm list 检查可用串口。

AgentKVM — AI 驱动的设备控制

AgentKVM 让您能够查看并操作通过 NanoKVM-USB 硬件连接的物理设备（iPhone、Android 手机、PC、Mac、Linux 机器）。您可以通过截图观察屏幕，然后发送鼠标点击、键盘输入和滚动操作进行交互——就像真人坐在设备前一样。

核心循环

与物理设备的每次交互都遵循相同模式：

截图 → 分析 → 操作 → 验证

1. 截图 — 捕获当前屏幕内容
分析 — 查看图像以理解 UI 状态
操作 — 根据所见内容进行点击、输入、滚动或拖拽
验证 — 再次截图确认操作生效

这个循环是您的基本构建模块。通过串联多次迭代来完成复杂任务。

快速开始

检查连接

bash
agentkvm --json status

如果失败，说明设备未连接。使用 agentkvm list 检查串口。

查看屏幕

bash
agentkvm --json screenshot

返回 { path: /path/to/screenshot.png, ... }。读取图像查看屏幕内容。

交互操作

bash

在像素坐标处点击（相对于裁剪后的图像）

agentkvm mouse click 223 485

输入文本

agentkvm type hello world

按下组合键

agentkvm key enter agentkvm key ctrl+c agentkvm key cmd+space

滚动（正数=向上，负数=向下）

agentkvm mouse scroll 300 500 --delta -3

从 A 点拖拽到 B 点

agentkvm mouse drag 100 200 400 600

远程操作

如果 AgentKVM 运行在其他机器上，所有命令加上 --remote 参数即可同样使用：

bash
agentkvm --remote http://192.168.1.100:7070 --token my-secret screenshot --json
agentkvm --remote http://192.168.1.100:7070 --token my-secret mouse click 223 485

或直接使用 HTTP API — 详见 references/api.md。

坐标系统说明

这一点至关重要。当您分析截图并识别出像素坐标 (x, y) 处的 UI 元素时，这些坐标是相对于截图图像本身的——左上角为 (0, 0)。直接将这些坐标传递给 agentkvm mouse click x y。

AgentKVM 会根据设备类型和裁剪设置，在内部自动转换为实际硬件坐标。您无需进行任何数学计算。

两种坐标模式

设备类型决定了坐标的转换方式：

设备模式（iPhone、Android）— 裁剪区域即为设备的完整屏幕。HID 绝对坐标 0–4096 映射到设备自身的显示器。适用于 HDMI 输出在较大捕获帧中显示设备屏幕的情况。

帧模式（PC、Mac、Linux）— 裁剪区域仅为视觉焦点区域；HID 坐标仍映射到完整显示器。适用于控制计算机且捕获分辨率与目标显示器匹配的情况。

模式会根据配置自动选择。您通常无需考虑这个问题。

执行任务

当被要求执行 GUI 任务时（例如打开 Safari 并搜索 X）：

第一步：先观察

始终从截图开始。切勿假设屏幕内容。

bash
agentkvm --json screenshot

读取返回的图像文件。描述您看到的内容——这能让您的操作基于实际情况。

第二步：规划操作

将任务分解为单个交互步骤。以打开 Safari 并搜索 X为例：

1. 找到 Safari 图标 → 点击
等待 Safari 加载 → 截图验证
找到地址栏 → 点击
输入搜索查询
按回车键
截图验证结果

第三步：执行并验证

每次重要操作后，截图确认是否生效。屏幕更新可能较慢，必要时在操作之间添加短暂等待（在脚本中使用 sleep）。

bash 脚本中的常见模式：

bash

在观察到的位置点击 Safari 图标

agentkvm mouse click 223 950
sleep 1

验证是否打开

agentkvm --json screenshot

（读取并分析截图）

点击地址栏

agentkvm mouse click 300 50 sleep 0.3

输入搜索查询

agentkvm type weather today agentkvm key enter sleep 2

验证搜索结果是否加载

agentkvm --json screenshot

第四步：处理失败

如果操作未产生预期结果：

- 元素可能已移动 — 重新截图并定位
屏幕可能尚未更新 — 等待后重试
可能点击了错误位置 — 重新分析并调整坐标

配置参考

所有设置位于 ~/.config/agentkvm/config.json。典型配置：

json
{
serialPort: /dev/tty.usbserial-2140,
resolution: { width: 1920, height: 1080 },
videoDevice: USB3 Video,
deviceType: iphone,
crop: { x: 738, y: 55, width: 447, height: 970 }
}

关键字段：

- serialPort — NanoKVM-USB 串口设备路径
resolution — HDMI 捕获分辨率
videoDevice — 视频捕获设备名称或索引
deviceType — 决定坐标模式（iphone/android = 设备模式，pc/mac/linux = 帧模式）
crop — 捕获帧中用作工作区域的子区域

配置完成后，可直接运行不带参数的命令：agentkvm screenshot、agentkvm mouse click 100 200 等。

可靠自动化技巧

优先点击文本标签而非图标——文本在截图中更容易精确定位。

使用 --json 进行程序化访问——所有命令均支持此参数，返回可解析的结构化数据。

单击无响应时尝试双击——某些 UI 元素需要 --double 参数。

小幅度滚动——--delta 1 或 --delta -1 为一个滚动步长。多次滚动并在其间截图验证。

连接不稳定时慢速输入——如果字符丢失，增加 --delay 参数（默认 50ms）。

使用组合键进行导航——cmd+space（Spotlight）、alt+tab（窗口切换）、ctrl+c（取消）通常比查找和点击 UI 元素更快。

完整的 CLI 参考、组合键语法和 HTTP API 详情，请参见 references/api.md。

agentkvm物理设备控制

agentkvm

Requirements

AgentKVM — AI-Driven Device Control

Core Loop

Quick Start

Check connection

See the screen

Interact

Remote operation

How Coordinates Work

Two coordinate modes

Implementing a Task

Step 1: Observe first

Step 2: Plan your actions

Step 3: Execute with verification

Step 4: Handle failures

Config Reference

Tips for Reliable Automation

系统要求

AgentKVM — AI 驱动的设备控制

核心循环

快速开始

检查连接

查看屏幕

交互操作

在像素坐标处点击（相对于裁剪后的图像）

输入文本

按下组合键

滚动（正数=向上，负数=向下）

从 A 点拖拽到 B 点

远程操作

坐标系统说明

两种坐标模式

执行任务

第一步：先观察

第二步：规划操作

第三步：执行并验证

在观察到的位置点击 Safari 图标

验证是否打开

（读取并分析截图）

点击地址栏

输入搜索查询

验证搜索结果是否加载

第四步：处理失败

配置参考

可靠自动化技巧

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement