Voice Assistant
Real-time voice interface for your OpenClaw agent. Talk to your agent and hear it respond — with configurable STT and TTS providers, full streaming at every stage, and sub-2 second time-to-first-audio.
Architecture
CODEBLOCK0
The voice interface connects to your running OpenClaw gateway's OpenAI-compatible endpoint. It's the same agent with all its context, tools, and memory — just with a voice.
Quick Start
CODEBLOCK1
Supported Providers
STT (Speech-to-Text)
| Provider | Model | Latency | Notes |
|---|
| Deepgram | nova-2 (streaming) | ~200-300ms | WebSocket streaming, best accuracy/speed |
| ElevenLabs |
Scribe v1 | ~300-500ms | REST-based, good multilingual |
TTS (Text-to-Speech)
| Provider | Model | Latency | Notes |
|---|
| Deepgram | aura-2 | ~200ms | WebSocket streaming, low cost |
| ElevenLabs |
Turbo v2.5 | ~300ms | Best voice quality, streaming |
Configuration
All configuration is via environment variables in .env:
CODEBLOCK2
Provider Combinations
| Setup | Best For |
|---|
| Deepgram STT + ElevenLabs TTS | Best quality voice output |
| Deepgram STT + Deepgram TTS |
Lowest latency, single vendor |
| ElevenLabs STT + ElevenLabs TTS | Best multilingual support |
How It Works
- 1. Browser captures mic audio via Web Audio API and streams raw PCM over a WebSocket
- Server receives audio and pipes it to the configured STT provider's streaming endpoint
- STT returns partial transcripts in real-time; on end-of-utterance the full text is sent to the OpenClaw gateway
- OpenClaw gateway streams the LLM response token-by-token via SSE (Server-Sent Events)
- Tokens are accumulated into sentence-sized chunks and streamed to the TTS provider
- TTS returns audio chunks that are immediately forwarded to the browser over the same WebSocket
- Browser plays audio using the Web Audio API with a jitter buffer for smooth playback
Interruption Handling (Barge-In)
When the user starts speaking while the agent is still talking:
- - Current TTS audio is immediately cancelled
- The agent stops its current response
- New STT session begins capturing the user's interruption
Usage Examples
CODEBLOCK3
Troubleshooting
- - No audio output? Check that your TTS API key is valid and the provider is set correctly
- High latency? Use Deepgram for both STT and TTS; ensure your gateway is on the same network
- Cuts off speech? Increase
VOICE_VAD_SILENCE_MS to 600-800ms - Echo/feedback? Use headphones, or enable the built-in echo cancellation in the browser UI
Latency Budget
| Stage | Target | Actual (typical) |
|---|
| Audio capture + VAD | <200ms | ~100-150ms |
| STT transcription |
<400ms | ~200-400ms |
| OpenClaw LLM first token| <1500ms | ~500-1500ms |
| TTS first audio chunk | <400ms | ~200-400ms |
|
Total first audio |
<2.5s |
~1.0-2.5s |
技能名称:语音助手
语音助手
为您的OpenClaw智能体提供实时语音接口。与您的智能体对话并聆听其回应——支持可配置的语音转文字(STT)和文字转语音(TTS)提供商,全程流式传输,首次音频响应时间低于2秒。
架构
浏览器麦克风 → WebSocket → STT(Deepgram / ElevenLabs)→ 文本
→ OpenClaw网关(/v1/chat/completions,流式传输)→ 响应文本
→ TTS(Deepgram Aura / ElevenLabs)→ 音频块 → 浏览器扬声器
语音接口连接到您正在运行的OpenClaw网关的兼容OpenAI的端点。它使用相同的智能体,包含所有上下文、工具和记忆——只是增加了语音功能。
快速开始
bash
cd {baseDir}
cp .env.example .env
填写您的API密钥和网关URL
uv run scripts/server.py
打开 http://localhost:7860 并点击麦克风
支持的提供商
STT(语音转文字)
| 提供商 | 模型 | 延迟 | 说明 |
|---|
| Deepgram | nova-2(流式) | ~200-300ms | WebSocket流式,最佳准确率/速度 |
| ElevenLabs |
Scribe v1 | ~300-500ms | 基于REST,多语言支持良好 |
TTS(文字转语音)
| 提供商 | 模型 | 延迟 | 说明 |
|---|
| Deepgram | aura-2 | ~200ms | WebSocket流式,低成本 |
| ElevenLabs |
Turbo v2.5 | ~300ms | 最佳语音质量,流式传输 |
配置
所有配置通过.env文件中的环境变量完成:
bash
=== 必填项 ===
OPENCLAW
GATEWAYURL=http://localhost:4141/v1 # 您的OpenClaw网关
OPENCLAW_MODEL=claude-sonnet-4-5-20250929 # 网关路由到的模型
=== STT提供商(选择其一) ===
VOICE
STTPROVIDER=deepgram # deepgram 或 elevenlabs
DEEPGRAM
APIKEY=your-key-here # 如果STT=deepgram则必填
ELEVENLABS
APIKEY=your-key-here # 如果STT=elevenlabs则必填
=== TTS提供商(选择其一) ===
VOICE
TTSPROVIDER=elevenlabs # deepgram 或 elevenlabs
使用与上述相同的API密钥
=== 可选调优 ===
VOICE
TTSVOICE=rachel # ElevenLabs语音名称/ID
VOICE
TTSVOICE_DG=aura-2-theia-en # Deepgram Aura语音
VOICE
VADSILENCE_MS=400 # 轮次结束前的静默时间(毫秒)
VOICE
SAMPLERATE=16000 # 音频采样率
VOICE
SERVERPORT=7860 # 服务器端口
VOICE
SYSTEMPROMPT= # 可选的系统提示覆盖
提供商组合
| 配置 | 最佳适用场景 |
|---|
| Deepgram STT + ElevenLabs TTS | 最佳语音输出质量 |
| Deepgram STT + Deepgram TTS |
最低延迟,单一供应商 |
| ElevenLabs STT + ElevenLabs TTS | 最佳多语言支持 |
工作原理
- 1. 浏览器通过Web Audio API捕获麦克风音频,并通过WebSocket流式传输原始PCM数据
- 服务器接收音频并将其传输到配置的STT提供商的流式端点
- STT实时返回部分转录文本;在话语结束时,完整文本被发送到OpenClaw网关
- OpenClaw网关通过SSE(服务器发送事件)逐令牌流式传输LLM响应
- 令牌被累积成句子大小的块,并流式传输到TTS提供商
- TTS返回音频块,这些块通过同一WebSocket立即转发到浏览器
- 浏览器使用Web Audio API播放音频,并带有抖动缓冲区以实现平滑播放
中断处理(插话)
当用户在智能体仍在说话时开始讲话:
- - 当前TTS音频立即取消
- 智能体停止当前响应
- 新的STT会话开始捕获用户的中断
使用示例
用户:嘿,设置我的语音助手
→ OpenClaw执行:cd {baseDir} && cp .env.example .env
→ 打开.env供用户填写API密钥
→ 运行:uv run scripts/server.py
用户:开始语音聊天
→ 在浏览器中打开 http://localhost:7860
用户:将TTS切换到Deepgram
→ 在.env中更新 VOICETTSPROVIDER=deepgram
→ 重启服务器
故障排除
- - 没有音频输出? 检查您的TTS API密钥是否有效,以及提供商设置是否正确
- 高延迟? 对STT和TTS均使用Deepgram;确保您的网关在同一网络上
- 语音被截断? 将VOICEVADSILENCE_MS增加到600-800ms
- 回声/反馈? 使用耳机,或启用浏览器UI中的内置回声消除功能
延迟预算
| 阶段 | 目标 | 实际(典型值) |
|---|
| 音频捕获 + VAD | <200ms | ~100-150ms |
| STT转录 |
<400ms | ~200-400ms |
| OpenClaw LLM首个令牌 | <1500ms | ~500-1500ms |
| TTS首个音频块 | <400ms | ~200-400ms |
|
首次音频总延迟 |
<2.5s |
~1.0-2.5s |