Requirements
Before using AgentKVM, ensure the following are installed and available:
- - AgentKVM CLI — INLINECODE0
- Node.js >= 18
- ffmpeg — required for screenshot capture (
brew install ffmpeg on macOS, apt install ffmpeg on Linux) - NanoKVM-USB hardware connected to the host machine via USB
- HDMI input from the target device connected to the NanoKVM-USB
Run agentkvm status to verify everything is set up correctly. If the CLI is not found, install it first. If the device is not detected, check agentkvm list for available serial ports.
AgentKVM — AI-Driven Device Control
AgentKVM lets you see and operate physical devices (iPhones, Android phones, PCs, Macs, Linux machines) connected via NanoKVM-USB hardware. You take screenshots to observe the screen, then send mouse clicks, keyboard input, and scrolls to interact — just like a human sitting in front of the device.
Core Loop
Every interaction with a physical device follows the same pattern:
CODEBLOCK0
- 1. Screenshot — capture what's currently on screen
- Analyze — look at the image to understand the UI state
- Act — click, type, scroll, or drag based on what you see
- Verify — take another screenshot to confirm the action worked
This loop is your fundamental building block. Chain multiple iterations to accomplish complex tasks.
Quick Start
Check connection
CODEBLOCK1
If this fails, the device isn't connected. Check the serial port with agentkvm list.
See the screen
CODEBLOCK2
Returns { "path": "/path/to/screenshot.png", ... }. Read the image to see what's on screen.
Interact
CODEBLOCK3
Remote operation
If AgentKVM is running on another machine, all commands work identically with --remote:
CODEBLOCK4
Or use the HTTP API directly — see references/api.md.
How Coordinates Work
This is critical to get right. When you analyze a screenshot and identify a UI element at pixel (x, y), those coordinates are relative to the screenshot image itself — top-left is (0, 0). Pass these coordinates directly to agentkvm mouse click x y.
AgentKVM handles the translation to the actual hardware coordinates internally, based on the device type and crop settings. You don't need to do any math.
Two coordinate modes
The device type determines how coordinates are translated:
"device" mode (iPhone, Android) — The cropped region IS the device's full screen. HID absolute coordinates 0–4096 map to the device's own display. Use this when the HDMI output shows the device screen within a larger capture frame.
"frame" mode (PC, Mac, Linux) — The cropped region is just a visual focus area; HID coordinates still map to the full monitor. Use this when you're controlling a computer where the capture resolution matches the target display.
The mode is selected automatically from the config. You rarely need to think about it.
Implementing a Task
When asked to perform a GUI task (e.g., "open Safari and search for X"):
Step 1: Observe first
Always start with a screenshot. Never assume what's on screen.
CODEBLOCK5
Read the returned image file. Describe what you see — this grounds your actions in reality.
Step 2: Plan your actions
Break the task into individual interactions. For "open Safari and search for X":
- 1. Find the Safari icon → click it
- Wait for Safari to load → screenshot to verify
- Find the address bar → click it
- Type the search query
- Press Enter
- Screenshot to verify results
Step 3: Execute with verification
After each significant action, take a screenshot to verify it worked. Screens can be slow to update, so add brief waits between actions when needed (use sleep in your script).
Common pattern in a bash script:
CODEBLOCK6
Step 4: Handle failures
If an action didn't produce the expected result:
- - The element might have moved — take a fresh screenshot and re-locate it
- The screen might not have updated yet — wait and retry
- You might have clicked the wrong spot — re-analyze and adjust coordinates
Config Reference
All settings live in ~/.config/agentkvm/config.json. A typical setup:
CODEBLOCK7
Key fields:
- -
serialPort — path to the NanoKVM-USB serial device - INLINECODE15 — HDMI capture resolution
- INLINECODE16 — video capture device name or index
- INLINECODE17 — determines coordinate mode (
iphone/android = device, pc/mac/linux = frame) - INLINECODE23 — sub-region of the capture frame to use as the working area
When config is set, you can run bare commands without flags: agentkvm screenshot, agentkvm mouse click 100 200, etc.
Tips for Reliable Automation
Prefer clicking on text labels over icons — text is easier to locate precisely in screenshots.
Use --json for programmatic access — all commands support it and return structured data you can parse.
Double-click when single-click doesn't respond — some UI elements need --double.
Scroll in small increments — --delta 1 or --delta -1 is one scroll step. Use multiple steps with verification screenshots in between.
Type slowly for unreliable connections — increase --delay (default 50ms) if characters get dropped.
Use key combos for navigation — cmd+space (Spotlight), alt+tab (window switch), ctrl+c (cancel) are often faster than finding and clicking UI elements.
For the full CLI reference, key combo syntax, and HTTP API details, see references/api.md.
系统要求
使用 AgentKVM 前,请确保已安装并准备好以下内容:
- - AgentKVM CLI — npm install -g agentkvm
- Node.js >= 18
- ffmpeg — 截图功能所需(macOS 使用 brew install ffmpeg,Linux 使用 apt install ffmpeg)
- NanoKVM-USB 硬件通过 USB 连接到主机
- 目标设备的 HDMI 输出 连接到 NanoKVM-USB
运行 agentkvm status 验证所有配置是否正确。如果找不到 CLI,请先安装。如果未检测到设备,请使用 agentkvm list 检查可用串口。
AgentKVM — AI 驱动的设备控制
AgentKVM 让您能够查看并操作通过 NanoKVM-USB 硬件连接的物理设备(iPhone、Android 手机、PC、Mac、Linux 机器)。您可以通过截图观察屏幕,然后发送鼠标点击、键盘输入和滚动操作进行交互——就像真人坐在设备前一样。
核心循环
与物理设备的每次交互都遵循相同模式:
截图 → 分析 → 操作 → 验证
- 1. 截图 — 捕获当前屏幕内容
- 分析 — 查看图像以理解 UI 状态
- 操作 — 根据所见内容进行点击、输入、滚动或拖拽
- 验证 — 再次截图确认操作生效
这个循环是您的基本构建模块。通过串联多次迭代来完成复杂任务。
快速开始
检查连接
bash
agentkvm --json status
如果失败,说明设备未连接。使用 agentkvm list 检查串口。
查看屏幕
bash
agentkvm --json screenshot
返回 { path: /path/to/screenshot.png, ... }。读取图像查看屏幕内容。
交互操作
bash
在像素坐标处点击(相对于裁剪后的图像)
agentkvm mouse click 223 485
输入文本
agentkvm type hello world
按下组合键
agentkvm key enter
agentkvm key ctrl+c
agentkvm key cmd+space
滚动(正数=向上,负数=向下)
agentkvm mouse scroll 300 500 --delta -3
从 A 点拖拽到 B 点
agentkvm mouse drag 100 200 400 600
远程操作
如果 AgentKVM 运行在其他机器上,所有命令加上 --remote 参数即可同样使用:
bash
agentkvm --remote http://192.168.1.100:7070 --token my-secret screenshot --json
agentkvm --remote http://192.168.1.100:7070 --token my-secret mouse click 223 485
或直接使用 HTTP API — 详见 references/api.md。
坐标系统说明
这一点至关重要。当您分析截图并识别出像素坐标 (x, y) 处的 UI 元素时,这些坐标是相对于截图图像本身的——左上角为 (0, 0)。直接将这些坐标传递给 agentkvm mouse click x y。
AgentKVM 会根据设备类型和裁剪设置,在内部自动转换为实际硬件坐标。您无需进行任何数学计算。
两种坐标模式
设备类型决定了坐标的转换方式:
设备模式(iPhone、Android)— 裁剪区域即为设备的完整屏幕。HID 绝对坐标 0–4096 映射到设备自身的显示器。适用于 HDMI 输出在较大捕获帧中显示设备屏幕的情况。
帧模式(PC、Mac、Linux)— 裁剪区域仅为视觉焦点区域;HID 坐标仍映射到完整显示器。适用于控制计算机且捕获分辨率与目标显示器匹配的情况。
模式会根据配置自动选择。您通常无需考虑这个问题。
执行任务
当被要求执行 GUI 任务时(例如打开 Safari 并搜索 X):
第一步:先观察
始终从截图开始。切勿假设屏幕内容。
bash
agentkvm --json screenshot
读取返回的图像文件。描述您看到的内容——这能让您的操作基于实际情况。
第二步:规划操作
将任务分解为单个交互步骤。以打开 Safari 并搜索 X为例:
- 1. 找到 Safari 图标 → 点击
- 等待 Safari 加载 → 截图验证
- 找到地址栏 → 点击
- 输入搜索查询
- 按回车键
- 截图验证结果
第三步:执行并验证
每次重要操作后,截图确认是否生效。屏幕更新可能较慢,必要时在操作之间添加短暂等待(在脚本中使用 sleep)。
bash 脚本中的常见模式:
bash
在观察到的位置点击 Safari 图标
agentkvm mouse click 223 950
sleep 1
验证是否打开
agentkvm --json screenshot
(读取并分析截图)
点击地址栏
agentkvm mouse click 300 50
sleep 0.3
输入搜索查询
agentkvm type weather today
agentkvm key enter
sleep 2
验证搜索结果是否加载
agentkvm --json screenshot
第四步:处理失败
如果操作未产生预期结果:
- - 元素可能已移动 — 重新截图并定位
- 屏幕可能尚未更新 — 等待后重试
- 可能点击了错误位置 — 重新分析并调整坐标
配置参考
所有设置位于 ~/.config/agentkvm/config.json。典型配置:
json
{
serialPort: /dev/tty.usbserial-2140,
resolution: { width: 1920, height: 1080 },
videoDevice: USB3 Video,
deviceType: iphone,
crop: { x: 738, y: 55, width: 447, height: 970 }
}
关键字段:
- - serialPort — NanoKVM-USB 串口设备路径
- resolution — HDMI 捕获分辨率
- videoDevice — 视频捕获设备名称或索引
- deviceType — 决定坐标模式(iphone/android = 设备模式,pc/mac/linux = 帧模式)
- crop — 捕获帧中用作工作区域的子区域
配置完成后,可直接运行不带参数的命令:agentkvm screenshot、agentkvm mouse click 100 200 等。
可靠自动化技巧
优先点击文本标签而非图标——文本在截图中更容易精确定位。
使用 --json 进行程序化访问——所有命令均支持此参数,返回可解析的结构化数据。
单击无响应时尝试双击——某些 UI 元素需要 --double 参数。
小幅度滚动——--delta 1 或 --delta -1 为一个滚动步长。多次滚动并在其间截图验证。
连接不稳定时慢速输入——如果字符丢失,增加 --delay 参数(默认 50ms)。
使用组合键进行导航——cmd+space(Spotlight)、alt+tab(窗口切换)、ctrl+c(取消)通常比查找和点击 UI 元素更快。
完整的 CLI 参考、组合键语法和 HTTP API 详情,请参见 references/api.md。