Clack

WebSocket relay server that enables real-time voice conversations with an OpenClaw agent.

Flow: Client audio (PCM 16kHz/16-bit/mono) → STT → OpenClaw Gateway → TTS → PCM audio back to client.

Per-session provider selection: The client can independently choose STT and TTS providers per call — any combination of on-device (Apple speech frameworks) and server-side providers (ElevenLabs, OpenAI, Deepgram). The server auto-detects all available providers based on configured API keys and exposes them via /info.

Prerequisites

- Python 3.10+
API key for at least one provider (ElevenLabs, OpenAI, or Deepgram) — not needed for local speech mode
OpenClaw Gateway with chatCompletions endpoint enabled
Root/sudo access (for systemd)
Secure connection: Domain with SSL (recommended) or Tailscale

Setup

Run the setup script. It creates a venv, installs deps, prompts for API keys, configures a systemd service, and optionally sets up SSL.

CODEBLOCK0

The script auto-detects your OpenClaw gateway config and interactively prompts for provider API keys (ElevenLabs, OpenAI, Deepgram — all optional). On re-runs, existing keys can be kept, updated, or deleted.

Options

CODEBLOCK1

Flag	Default	Description
INLINECODE2	INLINECODE3	Relay server port
INLINECODE4

(none) | Domain for SSL setup (enables WSS) |

Connection modes

All connections are encrypted. The app supports two modes:

Domain with SSL (recommended):

bash scripts/setup.sh --domain clack.yourdomain.com
# → wss://clack.yourdomain.com/voice

Requires a DNS A record pointing the domain to your server IP. The setup script auto-configures SSL via Caddy. You can use a free domain from DuckDNS or your own.

Tailscale:

# Install Tailscale on your server, then connect from the app using your Tailscale IP
# → ws://100.x.x.x:9878/voice (encrypted at network level)

No domain or SSL setup needed. Tailscale encrypts all traffic at the network layer. Install Tailscale on both your server and phone, then use the server's Tailscale IP in the app.

Security note: Port 9878 should be firewalled from the public internet. Only allow access via localhost (for Caddy reverse proxy) and Tailscale. The app does not support unencrypted public connections.

Enable OpenClaw Gateway endpoint

The gateway must have chatCompletions enabled. Apply this config patch:

CODEBLOCK4

Management

CODEBLOCK5

Client App

📱 iOS — Available on the App Store (or build from source at github.com/fbn3799/clack-app)
🤖 Android — Coming soon!

Security

Authentication

All endpoints except GET /health and POST /pair require a valid auth token (RELAY_AUTH_TOKEN). Tokens are verified using constant-time HMAC comparison to prevent timing attacks.

Pairing System

- 6-character alphanumeric one-time codes (~2.1 billion combinations)
Codes expire after 5 minutes (TTL) and are single-use
Rate limited: 5 attempts per IP per 5 minutes — returns HTTP 429 after
2-second delay on failed attempts to slow brute force
Generating a code requires the admin auth token (GET /pair)
Redeeming a code is public but rate-limited (POST /pair)

Encrypted Connections

- Domain mode: WSS (WebSocket Secure) via Caddy with automatic SSL certificates
Tailscale mode: WireGuard encryption at the network layer
The app enforces encrypted connections — no unencrypted public access
Port 9878 should be firewalled; only accessible via localhost and Tailscale

Input Sanitization

All user-facing text inputs are sanitized before processing:

- Voice transcripts: Capped at 300 characters (CLACK_MAX_INPUT_CHARS), echo detection filters feedback loops, hallucination detection discards nonsense STT output
User context: Stripped to natural-language characters only (letters, numbers, common punctuation, whitespace). Control characters, escape sequences, and non-printable characters are removed. Capped at 1000 characters. Context is wrapped in explicit delimiters before injection into the system prompt.
No shell execution: All external communication uses structured HTTP/WebSocket APIs. No user input is ever passed to a shell.

Data Privacy

- No analytics, tracking, or telemetry
Voice audio goes to your server and only to the providers you choose
The iOS app stores only settings locally (server address, token, preferences)
Third-party API usage depends on your provider config (ElevenLabs, OpenAI, Deepgram)

Session Routing

Each voice call creates a clack:<uuid> session in OpenClaw. These are small, isolated sessions — one per call — so voice conversations don't pollute your main agent context.

Session Picker

The session picker in the iOS app provides context injection only. When you select a session key, it is added as text context to the LLM prompt — it does not change routing. All voice calls still create their own clack:<uuid> session.

User Context

Users can provide persistent context that gets injected into the system prompt for every voice call. This lets the AI know about the user's preferences, notes, or any background information.

How to set context

- App text field: In the Clack app under Settings → Context, enter free-form text
Session picker: Select an OpenClaw session to inject its content as context
WebSocket message: Send {"type": "set_context", "text": "..."} during a voice session
HTTP API: PUT /context?token=...&text=... or POST /context with JSON body INLINECODE17

Context is sanitized before saving — only natural-language characters are kept (letters, numbers, common punctuation). IP addresses and domains are stripped. The server returns the sanitized text in the response so the app can show the user exactly what will be sent as context.

Context persists across calls and server restarts. Clear it via DELETE /context or by sending an empty set_context message.

Conversation History

The relay maintains a shared history file across calls for continuity. History is stored as JSON in CLACK_HISTORY_DIR (default: /var/lib/clack/history).

- Max messages: 50 (configurable via CLACK_MAX_HISTORY)
History persists across calls and server restarts
Viewable via GET /history, clearable via INLINECODE24

Echo Test Mode

For testing audio round-trips without using LLM credits:

- Server-wide: Set CLACK_ECHO_MODE=true environment variable
Per-session: Send {"type":"start","config":{"echo":true}} from the client

In echo mode, transcribed text is echoed back through TTS instead of being sent to the LLM. Audio is peak-normalized with capped gain to ensure consistent playback volume.

Provider Selection

STT and TTS providers can be configured independently per session. The server auto-detects all available providers at startup based on which API keys are set (ELEVENLABS_API_KEY, OPENAI_API_KEY, DEEPGRAM_API_KEY).

Available modes per direction (STT / TTS):

- On-device (local): Uses Apple's built-in speech frameworks. Zero API costs.
Server provider: ElevenLabs, OpenAI, or Deepgram — whichever keys are configured.

How it works:

1. App fetches GET /info to discover available providers
User picks STT and TTS providers independently in Settings → Voice
On call start, the app sends sttProvider and ttsProvider in the session config
Server creates the appropriate provider instances per session

Example combinations:
STT TTS Use case
ElevenLabs ElevenLabs Full cloud — best quality
On-device
ElevenLabs | Save STT costs, keep premium voices |

STT	TTS	Use case
ElevenLabs	ElevenLabs	Full cloud — best quality
On-device

Cost optimization: Use on-device STT (free, unlimited) with a premium cloud TTS voice — get great output quality while eliminating transcription costs entirely. Or go fully on-device for zero API spend.

Text input mode

When STT is set to on-device, the client sends transcribed text instead of audio:

CODEBLOCK6

When TTS is set to on-device, the server returns response_text only and skips audio synthesis.

AI Response Rules

- Responses are enforced to 1–3 sentences for natural voice conversation
Server-side max_tokens: 150 to prevent runaway responses
Server-side max input: 300 characters (CLACK_MAX_INPUT_CHARS) — transcripts exceeding this are truncated

HTTP Endpoints

Endpoint	Method	Auth	Description
INLINECODE35	GET	No	Health check — returns service status
INLINECODE36

POST | No | Redeem pairing code → get auth token (rate-limited) | | GET /pair | GET | Yes | Generate one-time pairing code | | GET /info | GET | Yes | Server info: agent name, available STT/TTS providers | | GET /voices | GET | Yes | List available TTS voices | | GET /sessions | GET | Yes | List active sessions | | GET /history | GET | Yes | Get conversation history | | DELETE /history | DELETE | Yes | Clear conversation history | | GET /context | GET | Yes | Get current user context | | PUT /context | PUT | Yes | Set user context (query param text) | | POST /context | POST | Yes | Set user context (JSON body {"text": "..."}) | | DELETE /context | DELETE | Yes | Clear user context | | WebSocket /voice | WS | Yes | Voice relay connection |

WebSocket Protocol

Endpoint: INLINECODE50

Client → Server

Message	Format	Description
INLINECODE51	JSON	Start session. Config: `voice`, `systemPrompt`, `echo`, `sttProvider`, INLINECODE56
Binary frames

bytes | Raw PCM audio (16kHz, 16-bit, mono) | | {"type":"text_input","text":"..."} | JSON | Local speech mode — send text directly | | {"type":"end_speech"} | JSON | Signal end of speech, triggers processing | | {"type":"interrupt"} | JSON | Cancel current TTS playback | | {"type":"ping"} | JSON | Keepalive | | {"type":"set_context","text":"..."} | JSON | Set user context (sanitized before saving) | | {"type":"auth","token":"..."} | JSON | Authenticate (alternative to query param) |

Server → Client

Message	Format	Description
INLINECODE63	JSON	Session ready
INLINECODE64 / INLINECODE65

JSON | Auth result | | {"type":"processing","stage":"..."} | JSON | Stage: transcribing, thinking, speaking, filtered | | {"type":"transcript","text":"...","final":true} | JSON | STT result | | {"type":"response_text","text":"..."} | JSON | LLM text response | | {"type":"response_start","format":"pcm_16000"} | JSON | Audio stream starting | | Binary frames | bytes | TTS audio (PCM 16kHz, 16-bit, mono) | | {"type":"response_end"} | JSON | Audio stream done | | {"type":"tts_cancelled"} | JSON | TTS playback was interrupted | | {"type":"context_updated","text":"..."} | JSON | Context saved — text contains the sanitized version | | {"type":"context_cleared"} | JSON | Context was cleared |

Features

- Multi-provider STT/TTS: ElevenLabs, OpenAI, and Deepgram support
Independent voice input/output configuration: Choose STT and TTS providers separately — full control over how your voice is transcribed and how the AI speaks back
On-device speech: Apple speech frameworks for STT and/or TTS — zero API costs, mix with cloud providers freely
Cost optimization: Use free on-device transcription with premium cloud voices, or go fully local for zero spend
Voice response rules: AI responses enforced short (1-3 sentences, max_tokens 150)
Input length limiting: Configurable max transcript length (default 300 chars)
Confidence filtering: Low-confidence STT results are discarded
Echo detection: Prevents feedback loops (TTS → mic → STT)
Echo test mode: Test audio pipeline without LLM (server-wide or per-session)
Audio normalization: Peak normalization with capped gain for echo mode playback
Audio chunking: Long recordings auto-split for reliable transcription
Hallucination detection: Filters repetitive/nonsense STT output
Interrupt/TTS cancellation: Cancel in-progress TTS for all providers
Pairing system: Rate-limited one-time codes for secure device pairing
Session isolation: Each call gets its own clack:<uuid> session
Conversation history: Shared across calls, 50 messages max, persistent
Token auth: Constant-time HMAC verification
Keepalive pings: Prevents client timeout during long LLM responses
Silence detection: Default threshold 220, configurable range 20–1000
Auto-restart: systemd restarts on crash

Voice Configuration

20 built-in ElevenLabs voices available. Default: Will. Pass voice name or ID in session config:

CODEBLOCK7

Available aliases: will, aria, roger, sarah, laura, charlie, george, callum, river, liam, charlotte, alice, matilda, jessica, eric, chris, brian, daniel, lily, bill.

Environment Variables

Variable	Default	Description
INLINECODE81	—	Required. Client auth token (32-char)
INLINECODE82

Provider API keys (ELEVENLABS_API_KEY, OPENAI_API_KEY, DEEPGRAM_API_KEY) are stored in config.json with restricted file permissions, not as environment variables. The setup script manages these interactively.

Clack

WebSocket 中继服务器，支持与 OpenClaw 代理进行实时语音对话。

流程： 客户端音频（PCM 16kHz/16位/单声道）→ STT → OpenClaw 网关 → TTS → PCM 音频返回客户端。

每会话提供商选择： 客户端可独立为每次通话选择 STT 和 TTS 提供商——支持设备端（Apple 语音框架）和服务器端提供商（ElevenLabs、OpenAI、Deepgram）的任意组合。服务器根据配置的 API 密钥自动检测所有可用提供商，并通过 /info 接口暴露。

前置条件

- Python 3.10+
至少一个提供商的 API 密钥（ElevenLabs、OpenAI 或 Deepgram）——本地语音模式无需密钥
启用了 chatCompletions 端点的 OpenClaw 网关
Root/sudo 权限（用于 systemd）
安全连接： 带 SSL 的域名（推荐）或 Tailscale

安装

运行安装脚本。它会创建虚拟环境、安装依赖、提示输入 API 密钥、配置 systemd 服务，并可选择设置 SSL。

bash
sudo bash scripts/setup.sh

该脚本会自动检测您的 OpenClaw 网关配置，并交互式提示输入提供商 API 密钥（ElevenLabs、OpenAI、Deepgram——均为可选）。重新运行时，可保留、更新或删除现有密钥。

选项

bash
bash scripts/setup.sh [--port 9878] [--domain clack.example.com]

标志	默认值	描述
--port	9878	中继服务器端口
--domain

（无） | SSL 设置的域名（启用 WSS） |

连接模式

所有连接均已加密。应用支持两种模式：

带 SSL 的域名（推荐）：
bash
bash scripts/setup.sh --domain clack.yourdomain.com

→ wss://clack.yourdomain.com/voice

需要将域名的 DNS A 记录指向您的服务器 IP。安装脚本通过 Caddy 自动配置 SSL。您可以使用 DuckDNS 的免费域名或自己的域名。

Tailscale：
bash

在服务器上安装 Tailscale，然后使用您的 Tailscale IP 从应用连接

→ ws://100.x.x.x:9878/voice（在网络层加密）

无需域名或 SSL 设置。Tailscale 在网络层加密所有流量。在服务器和手机上安装 Tailscale，然后在应用中使用服务器的 Tailscale IP。

安全说明： 端口 9878 应对公共互联网进行防火墙保护。仅允许通过 localhost（用于 Caddy 反向代理）和 Tailscale 访问。应用不支持未加密的公共连接。

启用 OpenClaw 网关端点

网关必须启用 chatCompletions。应用以下配置补丁：

json
{http: {endpoints: {chatCompletions: {enabled: true}}}}

管理

bash
clack status # 检查服务状态
clack restart # 重启服务器
clack logs # 查看日志
clack pair # 生成新的配对码
clack update # 拉取最新代码并重启
clack setup # 重新运行交互式设置（稍后添加 SSL、更新密钥等）
clack uninstall # 移除服务和虚拟环境

客户端应用

📱 iOS — 可在 App Store 获取（或从 github.com/fbn3799/clack-app 源码构建）
🤖 Android — 即将推出！

安全性

身份验证

除 GET /health 和 POST /pair 外的所有端点都需要有效的身份验证令牌（RELAYAUTHTOKEN）。令牌使用恒定时间 HMAC 比较进行验证，以防止时序攻击。

配对系统

- 6 位字母数字一次性代码（约 21 亿种组合）
代码在 5 分钟（TTL）后过期，且为一次性使用
速率限制： 每 IP 每 5 分钟 5 次尝试——超过后返回 HTTP 429
失败尝试后 2 秒延迟以减缓暴力破解
生成代码需要管理员身份验证令牌（GET /pair）
兑换代码是公开的但受速率限制（POST /pair）

加密连接

- 域名模式： 通过 Caddy 使用自动 SSL 证书的 WSS（WebSocket Secure）
Tailscale 模式： 网络层的 WireGuard 加密
应用强制使用加密连接——不支持未加密的公共访问
端口 9878 应受防火墙保护；仅可通过 localhost 和 Tailscale 访问

输入清理

所有面向用户的文本输入在处理前均经过清理：

- 语音转录： 上限为 300 个字符（CLACKMAXINPUT_CHARS），回声检测过滤反馈循环，幻觉检测丢弃无意义的 STT 输出
用户上下文： 仅保留自然语言字符（字母、数字、常见标点、空白）。控制字符、转义序列和不可打印字符被移除。上限为 1000 个字符。上下文在注入系统提示前被包裹在显式分隔符中。
无 shell 执行： 所有外部通信使用结构化的 HTTP/WebSocket API。用户输入从不传递给 shell。

数据隐私

- 无分析、跟踪或遥测
语音音频仅发送到您的服务器和您选择的提供商
iOS 应用仅在本地存储设置（服务器地址、令牌、偏好设置）
第三方 API 使用取决于您的提供商配置（ElevenLabs、OpenAI、Deepgram）

会话路由

每次语音通话在 OpenClaw 中创建一个 clack: 会话。这些是小型、隔离的会话——每次通话一个——因此语音对话不会污染您的主代理上下文。

会话选择器

iOS 应用中的会话选择器仅提供上下文注入。当您选择会话密钥时，它作为文本上下文添加到 LLM 提示中——它不会改变路由。所有语音通话仍然创建自己的 clack: 会话。

用户上下文

用户可以提供持久上下文，该上下文被注入到每次语音通话的系统提示中。这让 AI 了解用户的偏好、笔记或任何背景信息。

如何设置上下文

- 应用文本字段： 在 Clack 应用的设置 → 上下文中，输入自由格式文本
会话选择器： 选择一个 OpenClaw 会话以将其内容作为上下文注入
WebSocket 消息： 在语音会话期间发送 {type: set_context, text: ...}
HTTP API： PUT /context?token=...&text=... 或 POST /context 带 JSON 主体 {text: ...}

上下文在保存前经过清理——仅保留自然语言字符（字母、数字、常见标点）。IP 地址和域名被移除。服务器在响应中返回清理后的文本，以便应用向用户准确显示将作为上下文发送的内容。

上下文在通话和服务器重启之间持久存在。通过 DELETE /context 或发送空的 set_context 消息清除。

对话历史

中继维护一个跨通话的共享历史文件以实现连续性。历史记录以 JSON 格式存储在 CLACKHISTORYDIR（默认：/var/lib/clack/history）中。

- 最大消息数： 50（可通过 CLACKMAXHISTORY 配置）
历史记录在通话和服务器重启之间持久存在
可通过 GET /history 查看，通过 DELETE /history 清除

回声测试模式

用于测试音频往返而不消耗 LLM 额度：

- 服务器范围： 设置 CLACKECHOMODE=true 环境变量
每会话： 从客户端发送 {type:start,config:{echo:true}}

在回声模式下，转录的文本通过 TTS 回显，而不是发送到 LLM。音频经过峰值归一化，增益上限确保一致的播放音量。

提供商选择

STT 和 TTS 提供商可独立配置每个会话。服务器在启动时根据设置的 API 密钥（ELEVENLABSAPIKEY、OPENAIAPIKEY、DEEPGRAMAPIKEY）自动检测所有可用提供商。

每个方向（STT / TTS）的可用模式：

- 设备端（本地）： 使用 Apple 内置的语音框架。零 API 成本。
服务器提供商： ElevenLabs、OpenAI 或 Deepgram——取决于配置了哪些密钥。

工作原理：

1. 应用获取 GET /info 以发现可用提供商
用户在设置

clack部署管理Clack

clack

Clack

Prerequisites

Setup

Options

Connection modes

Enable OpenClaw Gateway endpoint

Management

Client App

Security

Authentication

Pairing System

Encrypted Connections

Input Sanitization

Data Privacy

Session Routing

Session Picker

User Context

How to set context

Conversation History

Echo Test Mode

Provider Selection

Available modes per direction (STT / TTS):

How it works:

Example combinations:STTTTSUse caseElevenLabsElevenLabsFull cloud — best qualityOn-device ElevenLabs | Save STT costs, keep premium voices |

Text input mode

AI Response Rules

HTTP Endpoints

WebSocket Protocol

Client → Server

Server → Client

Features

Voice Configuration

Environment Variables

Clack

前置条件

安装

选项

连接模式

→ wss://clack.yourdomain.com/voice

在服务器上安装 Tailscale，然后使用您的 Tailscale IP 从应用连接

→ ws://100.x.x.x:9878/voice（在网络层加密）

启用 OpenClaw 网关端点

管理

客户端应用

安全性

身份验证

配对系统

加密连接

输入清理

数据隐私

会话路由

会话选择器

用户上下文

如何设置上下文

对话历史

回声测试模式

提供商选择

每个方向（STT / TTS）的可用模式：

工作原理：

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

Example combinations:
STT TTS Use case
ElevenLabs ElevenLabs Full cloud — best quality
On-device
ElevenLabs | Save STT costs, keep premium voices |