MobiZen-GUI
VLM-based mobile automation framework — control Android devices via natural language.
Repo: https://github.com/alibaba/MobiZen-GUI
1. Environment Setup
1.1 Install ADB
CODEBLOCK0
1.2 Connect Device & Install ADBKeyboard
CODEBLOCK1
Then on device: Settings → System → Languages & Input → Virtual Keyboard → Enable ADBKeyboard.
1.3 Install Project
CODEBLOCK2
2. Quick Start (Config Only, No Code Changes)
Copy example config:
CODEBLOCK3
Only 3 fields need to be configured — api_key, base_url, model_name:
CODEBLOCK4
How to set these 3 fields: When the user asks to run a phone task but hasn't configured yet, the AI should ask the user to provide api_key, base_url, and model_name, then write them into my_config.yaml. The user can also manually edit the file. Any OpenAI-compatible API works.
Provider examples:
CODEBLOCK5
Run:
CODEBLOCK6
3. Configuration Reference
| Field | Default | Description |
|---|
| INLINECODE7 | INLINECODE8 (auto) | ADB device; null = first available |
| INLINECODE9 |
"" | Model API key |
|
base_url |
null | Model API endpoint |
|
model_name |
"gpt-4o" | Model identifier |
|
model_type |
"qwen3vl" | Coordinate system (999x999 virtual space) |
|
max_steps |
25 | Max execution steps |
|
step_delay |
2.0 | Delay between steps (seconds) |
|
first_step_delay |
4.0 | Delay after first step |
|
temperature |
0.1 | Sampling temperature |
|
top_p |
0.001 | Top-p sampling |
|
max_tokens |
1024 | Max output tokens |
|
timeout |
60 | Request timeout (seconds) |
|
use_adbkeyboard |
true | Chinese text input via ADBKeyboard |
|
screenshot_dir |
"./screenshots" | Screenshot save directory |
4. Advanced: Deploy MobiZen-GUI-4B Locally
For best results on Chinese mobile tasks, deploy the dedicated 4B model.
4.1 Download Model
CODEBLOCK7
Alternatively from ModelScope: https://modelscope.cn/models/GUIAgent/MobiZen-GUI-4B
4.2 Serve with vLLM
CODEBLOCK8
4.3 Point Config to Local Model
CODEBLOCK9
Then run as usual: python main.py --config my_config.yaml --instruction "..."
5. Customization (Requires Code Changes)
The framework uses a plugin architecture — three components can be swapped via config class paths:
| Component | Role | Base Class | Default Implementation |
|---|
| MessageBuilder | Builds prompt + screenshot for model | INLINECODE36 | INLINECODE37 |
| ModelClient |
Calls the model API |
core.model_clients.base.BaseModelClient |
core.model_clients.openai.OpenAIClient |
|
ResponseParser | Parses model output → action |
core.response_parsers.base.BaseResponseParser |
core.response_parsers.qwen.QwenResponseParser |
5.1 Custom Model Client
For non-OpenAI-compatible APIs:
CODEBLOCK10
Config:
CODEBLOCK11
5.2 Custom Message Builder
To change the system prompt or how screenshots/history are formatted:
CODEBLOCK12
Config:
CODEBLOCK13
5.3 Custom Response Parser
To parse a different model output format:
CODEBLOCK14
Action dict format: {"arguments": {"action": "<type>", ...}} — supported types: click, long_press, swipe, type, system_button, wait, terminate.
Config:
CODEBLOCK15
5.4 Add New Action Type
- 1. Add
_execute_<action>(self, args) method in INLINECODE51 - Add dispatch branch in INLINECODE52
- Update system prompt in INLINECODE53
6. Troubleshooting
- - Device not found: Run
adb devices — check USB/wireless connection - ADBKeyboard not working: Ensure enabled in device settings; test: INLINECODE55
- Model connection error: Verify
base_url + api_key; check network - Coordinate mismatch: Ensure
model_type matches your model; check screen size: INLINECODE59 - Duplicate action loop: Agent auto-stops after 5 identical actions; may indicate model confusion
技能名称:mobizen-gui
MobiZen-GUI
基于VLM的移动端自动化框架——通过自然语言控制Android设备。
仓库地址:https://github.com/alibaba/MobiZen-GUI
1. 环境配置
1.1 安装ADB
bash
macOS
brew install android-platform-tools
Linux
sudo apt-get install android-tools-adb
Windows:从 https://developer.android.com/studio/releases/platform-tools 下载
adb version # 验证安装
1.2 连接设备并安装ADBKeyboard
bash
adb devices # USB连接;或:adb tcpip 5555 && adb connect :5555
adb install ADBKeyboard.apk # 从 https://github.com/senzhk/ADBKeyBoard 下载
然后在设备上操作:设置 → 系统 → 语言与输入法 → 虚拟键盘 → 启用ADBKeyboard。
1.3 安装项目
bash
git clone https://github.com/alibaba/MobiZen-GUI.git && cd MobiZen-GUI
pip install -r requirements.txt # openai, pillow, pyyaml
2. 快速开始(仅需配置,无需修改代码)
复制示例配置文件:
bash
cp configexample.yaml myconfig.yaml
只需配置 3个字段 — apikey、baseurl、model_name:
yaml
api_key: your-api-key-here
base_url: https://api.openai.com/v1 # 模型端点地址
model_name: gpt-4o # 模型标识符
如何设置这3个字段:当用户要求执行手机任务但尚未配置时,AI应要求用户提供apikey、baseurl和modelname,然后将它们写入myconfig.yaml。用户也可以手动编辑该文件。任何兼容OpenAI的API均可使用。
提供商示例:
yaml
OpenAI
base_url: https://api.openai.com/v1
api_key: sk-...
model_name: gpt-4o
DeepSeek / Moonshot / 智谱AI 等
base_url: https://api.deepseek.com/v1
api_key: your-key
model_name: deepseek-chat
Ollama(本地)
base_url: http://localhost:11434/v1
api_key: dummy
model_name: llava
运行:
bash
python main.py --config my_config.yaml --instruction 打开微信并发送消息
3. 配置参考
| 字段 | 默认值 | 描述 |
|---|
| deviceid | null(自动) | ADB设备;null表示第一个可用设备 |
| apikey |
| 模型API密钥 |
| base_url | null | 模型API端点 |
| model_name | gpt-4o | 模型标识符 |
| model_type | qwen3vl | 坐标系(999x999虚拟空间) |
| max_steps | 25 | 最大执行步数 |
| step_delay | 2.0 | 步骤间延迟(秒) |
| first
stepdelay | 4.0 | 第一步后延迟 |
| temperature | 0.1 | 采样温度 |
| top_p | 0.001 | Top-p采样 |
| max_tokens | 1024 | 最大输出令牌数 |
| timeout | 60 | 请求超时时间(秒) |
| use_adbkeyboard | true | 通过ADBKeyboard输入中文文本 |
| screenshot_dir | ./screenshots | 截图保存目录 |
4. 进阶:本地部署MobiZen-GUI-4B
为在中文移动端任务上获得最佳效果,请部署专用4B模型。
4.1 下载模型
bash
pip install -U huggingface_hub
中国镜像(可选)
export HF_ENDPOINT=https://hf-mirror.com
hf download alibabagroup/MobiZen-GUI-4B --local-dir ./MobiZen-GUI-4B
或从ModelScope下载:https://modelscope.cn/models/GUIAgent/MobiZen-GUI-4B
4.2 使用vLLM提供服务
bash
pip install vllm==0.11.0
vllm serve ./MobiZen-GUI-4B --host 0.0.0.0 --port 8000 --trust-remote-code
4.3 将配置指向本地模型
yaml
api_key: dummy
base_url: http://localhost:8000/v1
model_name: MobiZen-GUI-4B
model_type: qwen3vl
然后正常运行:python main.py --config my_config.yaml --instruction ...
5. 自定义(需要修改代码)
该框架采用插件架构——三个组件可通过配置类路径进行替换:
| 组件 | 角色 | 基类 | 默认实现 |
|---|
| MessageBuilder | 为模型构建提示词和截图 | core.messagebuilders.base.BaseMessageBuilder | core.messagebuilders.qwen.QwenMessageBuilder |
| ModelClient |
调用模型API | core.model
clients.base.BaseModelClient | core.modelclients.openai.OpenAIClient |
|
ResponseParser | 解析模型输出→动作 | core.response
parsers.base.BaseResponseParser | core.responseparsers.qwen.QwenResponseParser |
5.1 自定义模型客户端
适用于非OpenAI兼容的API:
python
core/modelclients/myclient.py
from .base import BaseModelClient
class MyClient(BaseModelClient):
def init(self, apikey: str, baseurl: str = None, model: str = , timeout: int = 60):
pass # 初始化客户端
def chat(self, messages, kwargs):
pass # 必须返回包含 .choices[0].message.content 的对象
配置:
yaml
modelclientclass: core.modelclients.myclient.MyClient
modelclientkwargs: {} # 传递给init的额外参数
5.2 自定义消息构建器
用于更改系统提示词或截图/历史记录的格式:
python
core/messagebuilders/mybuilder.py
from .base import BaseMessageBuilder
from utils.image import image
todata_url
class MyBuilder(BaseMessageBuilder):
def buildsystemprompt(self, kwargs) -> str:
return 你的系统提示词
def buildmessages(self, instruction, currentscreenshot, history, kwargs):
return [{role: system, content: [...]}, {role: user, content: [...]}]
配置:
yaml
messagebuilderclass: core.messagebuilders.mybuilder.MyBuilder
5.3 自定义响应解析器
用于解析不同模型输出格式:
python
core/responseparsers/myparser.py
from .base import BaseResponseParser, ParsedResponse
class MyParser(BaseResponseParser):
def parse(self, response) -> ParsedResponse:
content = response.choices[0].message.content
# 将内容解析为结构化字段
return ParsedResponse(
thought=...,
summary=...,
action={arguments: {action: click, coordinate: [x, y]}},
subtask=...
)
动作字典格式:{arguments: {action: , ...}} — 支持的类型:click、longpress、swipe、type、systembutton、wait、terminate。
配置:
yaml
responseparserclass: core.responseparsers.myparser.MyParser
5.4 添加新动作类型
- 1. 在 core/executor/actionexecutor.py 中添加 execute(self, args) 方法
- 在 ActionExecutor.execute() 中添加分发分支
- 更新 QwenMessageBuilder.buildsystem_prompt() 中的系统提示词
6. 故障排除
- - 设备未找到:运行 adb devices — 检查USB/无线连接
- ADBKeyboard不工作:确保在设备设置中已启用;测试:adb shell am broadcast -a ADBINPUTTEXT --es msg test
- 模型连接错误:验证 baseurl + apikey;