Clack
WebSocket relay server that enables real-time voice conversations with an OpenClaw agent.
Flow: Client audio (PCM 16kHz/16-bit/mono) → STT → OpenClaw Gateway → TTS → PCM audio back to client.
Per-session provider selection: The client can independently choose STT and TTS providers per call — any combination of on-device (Apple speech frameworks) and server-side providers (ElevenLabs, OpenAI, Deepgram). The server auto-detects all available providers based on configured API keys and exposes them via /info.
Prerequisites
- - Python 3.10+
- API key for at least one provider (ElevenLabs, OpenAI, or Deepgram) — not needed for local speech mode
- OpenClaw Gateway with
chatCompletions endpoint enabled - Root/sudo access (for systemd)
- Secure connection: Domain with SSL (recommended) or Tailscale
Setup
Run the setup script. It creates a venv, installs deps, prompts for API keys, configures a systemd service, and optionally sets up SSL.
CODEBLOCK0
The script auto-detects your OpenClaw gateway config and interactively prompts for provider API keys (ElevenLabs, OpenAI, Deepgram — all optional). On re-runs, existing keys can be kept, updated, or deleted.
Options
CODEBLOCK1
| Flag | Default | Description |
|---|
| INLINECODE2 | INLINECODE3 | Relay server port |
| INLINECODE4 |
(none) | Domain for SSL setup (enables WSS) |
Connection modes
All connections are encrypted. The app supports two modes:
Domain with SSL (recommended):
bash scripts/setup.sh --domain clack.yourdomain.com
# → wss://clack.yourdomain.com/voice
Requires a DNS A record pointing the domain to your server IP. The setup script auto-configures SSL via Caddy. You can use a free domain from
DuckDNS or your own.
Tailscale:
# Install Tailscale on your server, then connect from the app using your Tailscale IP
# → ws://100.x.x.x:9878/voice (encrypted at network level)
No domain or SSL setup needed. Tailscale encrypts all traffic at the network layer. Install Tailscale on both your server and phone, then use the server's Tailscale IP in the app.
Security note: Port 9878 should be firewalled from the public internet. Only allow access via localhost (for Caddy reverse proxy) and Tailscale. The app does not support unencrypted public connections.
Enable OpenClaw Gateway endpoint
The gateway must have chatCompletions enabled. Apply this config patch:
CODEBLOCK4
Management
CODEBLOCK5
Client App
📱 iOS — Available on the App Store (or build from source at github.com/fbn3799/clack-app)
🤖 Android — Coming soon!
Security
Authentication
All endpoints except
GET /health and
POST /pair require a valid auth token (
RELAY_AUTH_TOKEN). Tokens are verified using constant-time HMAC comparison to prevent timing attacks.
Pairing System
- - 6-character alphanumeric one-time codes (~2.1 billion combinations)
- Codes expire after 5 minutes (TTL) and are single-use
- Rate limited: 5 attempts per IP per 5 minutes — returns HTTP 429 after
- 2-second delay on failed attempts to slow brute force
- Generating a code requires the admin auth token (
GET /pair) - Redeeming a code is public but rate-limited (
POST /pair)
Encrypted Connections
- - Domain mode: WSS (WebSocket Secure) via Caddy with automatic SSL certificates
- Tailscale mode: WireGuard encryption at the network layer
- The app enforces encrypted connections — no unencrypted public access
- Port 9878 should be firewalled; only accessible via localhost and Tailscale
Input Sanitization
All user-facing text inputs are sanitized before processing:
- - Voice transcripts: Capped at 300 characters (
CLACK_MAX_INPUT_CHARS), echo detection filters feedback loops, hallucination detection discards nonsense STT output - User context: Stripped to natural-language characters only (letters, numbers, common punctuation, whitespace). Control characters, escape sequences, and non-printable characters are removed. Capped at 1000 characters. Context is wrapped in explicit delimiters before injection into the system prompt.
- No shell execution: All external communication uses structured HTTP/WebSocket APIs. No user input is ever passed to a shell.
Data Privacy
- - No analytics, tracking, or telemetry
- Voice audio goes to your server and only to the providers you choose
- The iOS app stores only settings locally (server address, token, preferences)
- Third-party API usage depends on your provider config (ElevenLabs, OpenAI, Deepgram)
Session Routing
Each voice call creates a clack:<uuid> session in OpenClaw. These are small, isolated sessions — one per call — so voice conversations don't pollute your main agent context.
Session Picker
The session picker in the iOS app provides
context injection only. When you select a session key, it is added as text context to the LLM prompt — it does not change routing. All voice calls still create their own
clack:<uuid> session.
User Context
Users can provide persistent context that gets injected into the system prompt for every voice call. This lets the AI know about the user's preferences, notes, or any background information.
How to set context
- - App text field: In the Clack app under Settings → Context, enter free-form text
- Session picker: Select an OpenClaw session to inject its content as context
- WebSocket message: Send
{"type": "set_context", "text": "..."} during a voice session - HTTP API:
PUT /context?token=...&text=... or POST /context with JSON body INLINECODE17
Context is sanitized before saving — only natural-language characters are kept (letters, numbers, common punctuation). IP addresses and domains are stripped. The server returns the sanitized text in the response so the app can show the user exactly what will be sent as context.
Context persists across calls and server restarts. Clear it via DELETE /context or by sending an empty set_context message.
Conversation History
The relay maintains a shared history file across calls for continuity. History is stored as JSON in CLACK_HISTORY_DIR (default: /var/lib/clack/history).
- - Max messages: 50 (configurable via
CLACK_MAX_HISTORY) - History persists across calls and server restarts
- Viewable via
GET /history, clearable via INLINECODE24
Echo Test Mode
For testing audio round-trips without using LLM credits:
- - Server-wide: Set
CLACK_ECHO_MODE=true environment variable - Per-session: Send
{"type":"start","config":{"echo":true}} from the client
In echo mode, transcribed text is echoed back through TTS instead of being sent to the LLM. Audio is peak-normalized with capped gain to ensure consistent playback volume.
Provider Selection
STT and TTS providers can be configured independently per session. The server auto-detects all available providers at startup based on which API keys are set (ELEVENLABS_API_KEY, OPENAI_API_KEY, DEEPGRAM_API_KEY).
Available modes per direction (STT / TTS):
- - On-device (local): Uses Apple's built-in speech frameworks. Zero API costs.
- Server provider: ElevenLabs, OpenAI, or Deepgram — whichever keys are configured.
How it works:
- 1. App fetches
GET /info to discover available providers - User picks STT and TTS providers independently in Settings → Voice
- On call start, the app sends
sttProvider and ttsProvider in the session config - Server creates the appropriate provider instances per session
Example combinations:
| STT | TTS | Use case |
|---|
| ElevenLabs | ElevenLabs | Full cloud — best quality |
| On-device |
ElevenLabs | Save STT costs, keep premium voices |
| On-device | On-device | Fully local — zero API usage, works offline |
| OpenAI | Deepgram | Mix providers freely |
Cost optimization: Use on-device STT (free, unlimited) with a premium cloud TTS voice — get great output quality while eliminating transcription costs entirely. Or go fully on-device for zero API spend.
Text input mode
When STT is set to on-device, the client sends transcribed text instead of audio:
CODEBLOCK6
When TTS is set to on-device, the server returns response_text only and skips audio synthesis.
AI Response Rules
- - Responses are enforced to 1–3 sentences for natural voice conversation
- Server-side max_tokens: 150 to prevent runaway responses
- Server-side max input: 300 characters (
CLACK_MAX_INPUT_CHARS) — transcripts exceeding this are truncated
HTTP Endpoints
| Endpoint | Method | Auth | Description |
|---|
| INLINECODE35 | GET | No | Health check — returns service status |
| INLINECODE36 |
POST | No | Redeem pairing code → get auth token (rate-limited) |
|
GET /pair | GET | Yes | Generate one-time pairing code |
|
GET /info | GET | Yes | Server info: agent name, available STT/TTS providers |
|
GET /voices | GET | Yes | List available TTS voices |
|
GET /sessions | GET | Yes | List active sessions |
|
GET /history | GET | Yes | Get conversation history |
|
DELETE /history | DELETE | Yes | Clear conversation history |
|
GET /context | GET | Yes | Get current user context |
|
PUT /context | PUT | Yes | Set user context (query param
text) |
|
POST /context | POST | Yes | Set user context (JSON body
{"text": "..."}) |
|
DELETE /context | DELETE | Yes | Clear user context |
|
WebSocket /voice | WS | Yes | Voice relay connection |
WebSocket Protocol
Endpoint: INLINECODE50
Client → Server
| Message | Format | Description |
|---|
| INLINECODE51 | JSON | Start session. Config: voice, systemPrompt, echo, sttProvider, INLINECODE56 |
| Binary frames |
bytes | Raw PCM audio (16kHz, 16-bit, mono) |
|
{"type":"text_input","text":"..."} | JSON | Local speech mode — send text directly |
|
{"type":"end_speech"} | JSON | Signal end of speech, triggers processing |
|
{"type":"interrupt"} | JSON | Cancel current TTS playback |
|
{"type":"ping"} | JSON | Keepalive |
|
{"type":"set_context","text":"..."} | JSON | Set user context (sanitized before saving) |
|
{"type":"auth","token":"..."} | JSON | Authenticate (alternative to query param) |
Server → Client
| Message | Format | Description |
|---|
| INLINECODE63 | JSON | Session ready |
| INLINECODE64 / INLINECODE65 |
JSON | Auth result |
|
{"type":"processing","stage":"..."} | JSON | Stage:
transcribing,
thinking,
speaking,
filtered |
|
{"type":"transcript","text":"...","final":true} | JSON | STT result |
|
{"type":"response_text","text":"..."} | JSON | LLM text response |
|
{"type":"response_start","format":"pcm_16000"} | JSON | Audio stream starting |
| Binary frames | bytes | TTS audio (PCM 16kHz, 16-bit, mono) |
|
{"type":"response_end"} | JSON | Audio stream done |
|
{"type":"tts_cancelled"} | JSON | TTS playback was interrupted |
|
{"type":"context_updated","text":"..."} | JSON | Context saved —
text contains the sanitized version |
|
{"type":"context_cleared"} | JSON | Context was cleared |
Features
- - Multi-provider STT/TTS: ElevenLabs, OpenAI, and Deepgram support
- Independent voice input/output configuration: Choose STT and TTS providers separately — full control over how your voice is transcribed and how the AI speaks back
- On-device speech: Apple speech frameworks for STT and/or TTS — zero API costs, mix with cloud providers freely
- Cost optimization: Use free on-device transcription with premium cloud voices, or go fully local for zero spend
- Voice response rules: AI responses enforced short (1-3 sentences, max_tokens 150)
- Input length limiting: Configurable max transcript length (default 300 chars)
- Confidence filtering: Low-confidence STT results are discarded
- Echo detection: Prevents feedback loops (TTS → mic → STT)
- Echo test mode: Test audio pipeline without LLM (server-wide or per-session)
- Audio normalization: Peak normalization with capped gain for echo mode playback
- Audio chunking: Long recordings auto-split for reliable transcription
- Hallucination detection: Filters repetitive/nonsense STT output
- Interrupt/TTS cancellation: Cancel in-progress TTS for all providers
- Pairing system: Rate-limited one-time codes for secure device pairing
- Session isolation: Each call gets its own
clack:<uuid> session - Conversation history: Shared across calls, 50 messages max, persistent
- Token auth: Constant-time HMAC verification
- Keepalive pings: Prevents client timeout during long LLM responses
- Silence detection: Default threshold 220, configurable range 20–1000
- Auto-restart: systemd restarts on crash
Voice Configuration
20 built-in ElevenLabs voices available. Default: Will. Pass voice name or ID in session config:
CODEBLOCK7
Available aliases: will, aria, roger, sarah, laura, charlie, george, callum, river, liam, charlotte, alice, matilda, jessica, eric, chris, brian, daniel, lily, bill.
Environment Variables
| Variable | Default | Description |
|---|
| INLINECODE81 | — | Required. Client auth token (32-char) |
| INLINECODE82 |
http://127.0.0.1:18789 | OpenClaw Gateway URL |
|
OPENCLAW_GATEWAY_TOKEN | — | Gateway bearer token |
|
STT_PROVIDER |
elevenlabs | STT provider (
elevenlabs,
openai,
deepgram) |
|
TTS_PROVIDER |
elevenlabs | TTS provider (
elevenlabs,
openai,
deepgram) |
|
TTS_VOICE |
Will | Default voice (name or ID) |
|
VOICE_RELAY_PORT |
9878 | Server port |
|
CLACK_ECHO_MODE |
false | Enable echo test mode server-wide |
|
CLACK_MAX_INPUT_CHARS |
300 | Max transcript length (chars) |
|
CLACK_HISTORY_DIR |
/var/lib/clack/history | History file storage directory |
|
CLACK_MAX_HISTORY |
50 | Max conversation history messages |
|
CLACK_AGENT_NAME |
Storm | Agent name shown in the iOS app |
Provider API keys (ELEVENLABS_API_KEY, OPENAI_API_KEY, DEEPGRAM_API_KEY) are stored in config.json with restricted file permissions, not as environment variables. The setup script manages these interactively.
Clack
WebSocket 中继服务器,支持与 OpenClaw 代理进行实时语音对话。
流程: 客户端音频(PCM 16kHz/16位/单声道)→ STT → OpenClaw 网关 → TTS → PCM 音频返回客户端。
每会话提供商选择: 客户端可独立为每次通话选择 STT 和 TTS 提供商——支持设备端(Apple 语音框架)和服务器端提供商(ElevenLabs、OpenAI、Deepgram)的任意组合。服务器根据配置的 API 密钥自动检测所有可用提供商,并通过 /info 接口暴露。
前置条件
- - Python 3.10+
- 至少一个提供商的 API 密钥(ElevenLabs、OpenAI 或 Deepgram)——本地语音模式无需密钥
- 启用了 chatCompletions 端点的 OpenClaw 网关
- Root/sudo 权限(用于 systemd)
- 安全连接: 带 SSL 的域名(推荐)或 Tailscale
安装
运行安装脚本。它会创建虚拟环境、安装依赖、提示输入 API 密钥、配置 systemd 服务,并可选择设置 SSL。
bash
sudo bash scripts/setup.sh
该脚本会自动检测您的 OpenClaw 网关配置,并交互式提示输入提供商 API 密钥(ElevenLabs、OpenAI、Deepgram——均为可选)。重新运行时,可保留、更新或删除现有密钥。
选项
bash
bash scripts/setup.sh [--port 9878] [--domain clack.example.com]
| 标志 | 默认值 | 描述 |
|---|
| --port | 9878 | 中继服务器端口 |
| --domain |
(无) | SSL 设置的域名(启用 WSS) |
连接模式
所有连接均已加密。应用支持两种模式:
带 SSL 的域名(推荐):
bash
bash scripts/setup.sh --domain clack.yourdomain.com
→ wss://clack.yourdomain.com/voice
需要将域名的 DNS A 记录指向您的服务器 IP。安装脚本通过 Caddy 自动配置 SSL。您可以使用 DuckDNS 的免费域名或自己的域名。
Tailscale:
bash
在服务器上安装 Tailscale,然后使用您的 Tailscale IP 从应用连接
→ ws://100.x.x.x:9878/voice(在网络层加密)
无需域名或 SSL 设置。Tailscale 在网络层加密所有流量。在服务器和手机上安装 Tailscale,然后在应用中使用服务器的 Tailscale IP。
安全说明: 端口 9878 应对公共互联网进行防火墙保护。仅允许通过 localhost(用于 Caddy 反向代理)和 Tailscale 访问。应用不支持未加密的公共连接。
启用 OpenClaw 网关端点
网关必须启用 chatCompletions。应用以下配置补丁:
json
{http: {endpoints: {chatCompletions: {enabled: true}}}}
管理
bash
clack status # 检查服务状态
clack restart # 重启服务器
clack logs # 查看日志
clack pair # 生成新的配对码
clack update # 拉取最新代码并重启
clack setup # 重新运行交互式设置(稍后添加 SSL、更新密钥等)
clack uninstall # 移除服务和虚拟环境
客户端应用
📱 iOS — 可在 App Store 获取(或从 github.com/fbn3799/clack-app 源码构建)
🤖 Android — 即将推出!
安全性
身份验证
除 GET /health 和 POST /pair 外的所有端点都需要有效的身份验证令牌(RELAY
AUTHTOKEN)。令牌使用恒定时间 HMAC 比较进行验证,以防止时序攻击。
配对系统
- - 6 位字母数字一次性代码(约 21 亿种组合)
- 代码在 5 分钟(TTL)后过期,且为一次性使用
- 速率限制: 每 IP 每 5 分钟 5 次尝试——超过后返回 HTTP 429
- 失败尝试后 2 秒延迟以减缓暴力破解
- 生成代码需要管理员身份验证令牌(GET /pair)
- 兑换代码是公开的但受速率限制(POST /pair)
加密连接
- - 域名模式: 通过 Caddy 使用自动 SSL 证书的 WSS(WebSocket Secure)
- Tailscale 模式: 网络层的 WireGuard 加密
- 应用强制使用加密连接——不支持未加密的公共访问
- 端口 9878 应受防火墙保护;仅可通过 localhost 和 Tailscale 访问
输入清理
所有面向用户的文本输入在处理前均经过清理:
- - 语音转录: 上限为 300 个字符(CLACKMAXINPUT_CHARS),回声检测过滤反馈循环,幻觉检测丢弃无意义的 STT 输出
- 用户上下文: 仅保留自然语言字符(字母、数字、常见标点、空白)。控制字符、转义序列和不可打印字符被移除。上限为 1000 个字符。上下文在注入系统提示前被包裹在显式分隔符中。
- 无 shell 执行: 所有外部通信使用结构化的 HTTP/WebSocket API。用户输入从不传递给 shell。
数据隐私
- - 无分析、跟踪或遥测
- 语音音频仅发送到您的服务器和您选择的提供商
- iOS 应用仅在本地存储设置(服务器地址、令牌、偏好设置)
- 第三方 API 使用取决于您的提供商配置(ElevenLabs、OpenAI、Deepgram)
会话路由
每次语音通话在 OpenClaw 中创建一个 clack: 会话。这些是小型、隔离的会话——每次通话一个——因此语音对话不会污染您的主代理上下文。
会话选择器
iOS 应用中的会话选择器仅提供
上下文注入。当您选择会话密钥时,它作为文本上下文添加到 LLM 提示中——它不会改变路由。所有语音通话仍然创建自己的 clack:
会话。
用户上下文
用户可以提供持久上下文,该上下文被注入到每次语音通话的系统提示中。这让 AI 了解用户的偏好、笔记或任何背景信息。
如何设置上下文
- - 应用文本字段: 在 Clack 应用的设置 → 上下文中,输入自由格式文本
- 会话选择器: 选择一个 OpenClaw 会话以将其内容作为上下文注入
- WebSocket 消息: 在语音会话期间发送 {type: set_context, text: ...}
- HTTP API: PUT /context?token=...&text=... 或 POST /context 带 JSON 主体 {text: ...}
上下文在保存前经过清理——仅保留自然语言字符(字母、数字、常见标点)。IP 地址和域名被移除。服务器在响应中返回清理后的文本,以便应用向用户准确显示将作为上下文发送的内容。
上下文在通话和服务器重启之间持久存在。通过 DELETE /context 或发送空的 set_context 消息清除。
对话历史
中继维护一个跨通话的共享历史文件以实现连续性。历史记录以 JSON 格式存储在 CLACKHISTORYDIR(默认:/var/lib/clack/history)中。
- - 最大消息数: 50(可通过 CLACKMAXHISTORY 配置)
- 历史记录在通话和服务器重启之间持久存在
- 可通过 GET /history 查看,通过 DELETE /history 清除
回声测试模式
用于测试音频往返而不消耗 LLM 额度:
- - 服务器范围: 设置 CLACKECHOMODE=true 环境变量
- 每会话: 从客户端发送 {type:start,config:{echo:true}}
在回声模式下,转录的文本通过 TTS 回显,而不是发送到 LLM。音频经过峰值归一化,增益上限确保一致的播放音量。
提供商选择
STT 和 TTS 提供商可独立配置每个会话。服务器在启动时根据设置的 API 密钥(ELEVENLABSAPIKEY、OPENAIAPIKEY、DEEPGRAMAPIKEY)自动检测所有可用提供商。
每个方向(STT / TTS)的可用模式:
- - 设备端(本地): 使用 Apple 内置的语音框架。零 API 成本。
- 服务器提供商: ElevenLabs、OpenAI 或 Deepgram——取决于配置了哪些密钥。
工作原理:
- 1. 应用获取 GET /info 以发现可用提供商
- 用户在设置