Faster Whisper Local Service
Provision a local STT backend used by voice skills.
What this sets up
- - Python venv for faster-whisper
- INLINECODE0 HTTP endpoint at INLINECODE1
- systemd user service: INLINECODE2
Important: Model download on first run
On first startup, faster-whisper downloads model weights from Hugging Face (~1.5 GB for medium). This requires internet access and disk space. After the initial download, models are cached locally and the service runs fully offline.
| Model | Download size | RAM usage |
|---|
| tiny | ~75 MB | ~400 MB |
| base |
~150 MB | ~500 MB |
| small | ~500 MB | ~800 MB |
| medium | ~1.5 GB | ~1.4 GB |
| large-v3 | ~3.0 GB | ~3.5 GB |
To pre-download models in an air-gapped environment, see faster-whisper docs.
Security notes
Network isolation
- - Binds to
127.0.0.1 only — not reachable from the network. - CORS restricted to a single origin (
https://127.0.0.1:8443 by default). - No credentials, API keys, or secrets are used or stored.
Input validation
- - Upload size limit: Requests exceeding the configured limit are rejected before processing (HTTP 413). Default: 50 MB, configurable via
MAX_UPLOAD_MB. - Magic-byte check: Only files with recognized audio signatures (WAV, OGG, FLAC, MP3, WebM, M4A) are accepted. Unrecognized formats are rejected (HTTP 415) before reaching GStreamer.
- Subprocess safety: All arguments to
gst-launch-1.0 are passed as a list — no shell expansion or injection is possible.
GStreamer dependency
The service uses GStreamer's
decodebin for audio format conversion. Like any media library, GStreamer's parsers process binary data and should be kept up to date.
Mitigation: install
gst-launch-1.0 from your OS vendor's trusted packages and apply security updates regularly. The magic-byte pre-filter above reduces the attack surface by rejecting non-audio payloads before they reach GStreamer.
No data exfiltration
- - No outbound network calls (after initial model download).
- No telemetry, analytics, or phone-home behavior.
- Temporary files are created in a per-request
TemporaryDirectory and cleaned up immediately.
Reproducibility defaults
- - Pinned package:
faster-whisper==1.1.1 (override via env) - Explicit dependency check for INLINECODE12
- CORS restricted to one origin by default
- Configurable workspace/service paths (no hardcoded user path)
Deploy
CODEBLOCK0
With custom settings:
CODEBLOCK1
Language setting
Default: auto (auto-detect language). Set WHISPER_LANGUAGE=de for German-only, en for English-only, etc. Fixed language is faster and more accurate if you only use one language.
Idempotent: safe to run repeatedly.
What this skill modifies
| What | Path | Action |
|---|
| Python venv | INLINECODE16 | Creates venv, installs faster-whisper via pip |
| Transcribe server |
$WORKSPACE/voice-input/transcribe-server.py | Writes server script |
| Systemd service |
~/.config/systemd/user/openclaw-transcribe.service | Creates + enables persistent service |
| Model cache |
~/.cache/huggingface/ | Downloads model weights on first run |
Uninstall
CODEBLOCK2
Optional full cleanup:
CODEBLOCK3
Verify
CODEBLOCK4
Expected:
- - service INLINECODE20
- endpoint responds (HTTP 200/500 acceptable for invalid sample payload)
Notes
- - This skill provides backend transcription only.
- Pair with
webchat-voice-proxy for browser mic + HTTPS/WSS integration. - For one-step install, use
webchat-voice-full-stack (deploys backend + proxy in order).
Faster Whisper 本地服务
为语音技能提供本地语音转文字后端。
本服务设置的内容
- - 用于 faster-whisper 的 Python 虚拟环境
- 位于 http://127.0.0.1:18790/transcribe 的 transcribe-server.py HTTP 端点
- systemd 用户服务:openclaw-transcribe.service
重要提示:首次运行时下载模型
首次启动时,faster-whisper 会从 Hugging Face 下载模型权重(medium 模型约 1.5 GB)。这需要互联网连接和磁盘空间。初始下载完成后,模型会缓存到本地,服务将完全离线运行。
| 模型 | 下载大小 | 内存占用 |
|---|
| tiny | 约 75 MB | 约 400 MB |
| base |
约 150 MB | 约 500 MB |
| small | 约 500 MB | 约 800 MB |
| medium | 约 1.5 GB | 约 1.4 GB |
| large-v3 | 约 3.0 GB | 约 3.5 GB |
如需在离线环境中预下载模型,请参阅 faster-whisper 文档。
安全说明
网络隔离
- - 仅绑定到 127.0.0.1 — 无法从网络访问。
- CORS 限制为单一来源(默认为 https://127.0.0.1:8443)。
- 不使用或存储任何凭据、API 密钥或机密信息。
输入验证
- - 上传大小限制:超过配置限制的请求在处理前将被拒绝(HTTP 413)。默认值:50 MB,可通过 MAXUPLOADMB 配置。
- 魔数检查:仅接受具有可识别音频签名(WAV、OGG、FLAC、MP3、WebM、M4A)的文件。无法识别的格式在到达 GStreamer 前将被拒绝(HTTP 415)。
- 子进程安全:传递给 gst-launch-1.0 的所有参数均以列表形式传递 — 不存在 shell 扩展或注入风险。
GStreamer 依赖
该服务使用 GStreamer 的 decodebin 进行音频格式转换。与任何媒体库一样,GStreamer 的解析器处理二进制数据,应保持更新。
缓解措施:从操作系统供应商的可信软件包安装 gst-launch-1.0,并定期应用安全更新。上述魔数预过滤通过在非音频负载到达 GStreamer 前将其拒绝,减少了攻击面。
无数据泄露
- - 无出站网络调用(初始模型下载后)。
- 无遥测、分析或回传行为。
- 临时文件在每次请求的 TemporaryDirectory 中创建,并立即清理。
可复现性默认设置
- - 固定包:faster-whisper==1.1.1(可通过环境变量覆盖)
- 显式检查 gst-launch-1.0 依赖
- 默认将 CORS 限制为单一来源
- 可配置的工作空间/服务路径(无硬编码用户路径)
部署
bash
bash scripts/deploy.sh
使用自定义设置:
bash
WORKSPACE=~/.openclaw/workspace \
TRANSCRIBE_PORT=18790 \
WHISPERMODELSIZE=medium \
WHISPER_LANGUAGE=auto \
TRANSCRIBEALLOWEDORIGIN=https://10.0.0.42:8443 \
bash scripts/deploy.sh
语言设置
默认值:auto(自动检测语言)。设置为 WHISPER_LANGUAGE=de 仅支持德语,en 仅支持英语等。如果只使用一种语言,固定语言模式更快且更准确。
幂等性:可安全重复运行。
本技能修改的内容
| 内容 | 路径 | 操作 |
|---|
| Python 虚拟环境 | $WORKSPACE/.venv-faster-whisper/ | 创建虚拟环境,通过 pip 安装 faster-whisper |
| 转录服务器 |
$WORKSPACE/voice-input/transcribe-server.py | 写入服务器脚本 |
| Systemd 服务 | ~/.config/systemd/user/openclaw-transcribe.service | 创建并启用持久化服务 |
| 模型缓存 | ~/.cache/huggingface/ | 首次运行时下载模型权重 |
卸载
bash
systemctl --user stop openclaw-transcribe.service
systemctl --user disable openclaw-transcribe.service
rm -f ~/.config/systemd/user/openclaw-transcribe.service
systemctl --user daemon-reload
可选完全清理:
bash
rm -rf ~/.openclaw/workspace/.venv-faster-whisper
rm -f ~/.openclaw/workspace/voice-input/transcribe-server.py
验证
bash
bash scripts/status.sh
预期结果:
- - 服务状态为 active
- 端点响应正常(对于无效示例负载,HTTP 200/500 均可接受)
备注
- - 本技能仅提供后端转录功能。
- 与 webchat-voice-proxy 配合使用,可实现浏览器麦克风 + HTTPS/WSS 集成。
- 如需一键安装,请使用 webchat-voice-full-stack(按顺序部署后端和代理)。