Local TTS Workflow
Use this skill to debug the actual speech pipeline and to prepare text so the model reads it sanely.
Do not hardcode 127.0.0.1 blindly. Read the active OpenClaw config first and use the current messages.tts.openai.baseUrl as the source of truth.
Current known deployment in this workspace: http://127.0.0.1:8000/v1.
Current local model-path fallback worth remembering: if the server did not pull a model by registry name, it may be loading directly from a local path such as ./models/qwen3-tts-0.6b-mlx.
When exact route shape matters, the local OpenAPI document is available at:
Use this OpenAPI doc as a schema/reference source to compare this local mlx-audio server against OpenAI’s API. Do not treat it as a health check.
Core rule: normalize numbers before synthesis
If text is meant to be spoken aloud, do not leave Arabic numerals in the final TTS input.
Convert them into words first.
Examples:
- - Chinese output: write
一 二 三, not INLINECODE7 - English output: write
one two three, not INLINECODE9
This rule matters because the TTS model can go weird or read digits badly when fed raw numerals.
When preparing spoken text, normalize:
- - dates
- times
- counts
- version-like strings if they will be read aloud
- mixed Chinese/English numeric snippets
If preserving exact machine-readable formatting matters, keep one copy for display and a separate normalized copy for TTS.
Workflow
1. Verify the server before touching OpenClaw
Read ~/.openclaw/openclaw.json first and extract:
- - INLINECODE11
- INLINECODE12
- INLINECODE13
- INLINECODE14
Check the basics against the actual configured host:
CODEBLOCK0
Confirm that the intended TTS model exists.
If the model does not appear by pulled registry name, do not assume TTS is broken — this server may be loading a local-path model such as ./models/qwen3-tts-0.6b-mlx.
If the server is task-gated, ensure TTS is enabled:
CODEBLOCK1
2. Prove the raw TTS endpoint works
Always isolate the server from the client stack.
Minimal non-streaming test:
CODEBLOCK2
Basic streaming test:
CODEBLOCK3
If direct curl works but OpenClaw does not, the bug is probably in the TTS integration or provider selection layer, not the TTS backend.
3. Distinguish server failure from integration failure
Use this rule:
- - Direct curl fails → fix the local TTS server first
- Direct curl works, but OpenClaw sounds wrong or falls back → inspect OpenClaw provider selection, fallback, and request shape
- OpenClaw sends requests but voice/mode is wrong → inspect fields like
model, voice, instructions, ref_audio, ref_text, and streaming flags
4. Know the four TTS modes
Use the right request shape for the right model type.
Base speaker
Use built-in speaker playback.
Typical shape:
- - model type: INLINECODE22
- no full INLINECODE23
- INLINECODE24 means built-in speaker name
Base clone
Use clone-style synthesis.
Typical shape:
- - model type: INLINECODE25
- must provide both
ref_audio and ref_text, or supply a consent voice identity that resolves to both
Hard rule: do not attempt clone with only ref_audio.
CustomVoice
Use a model with prebuilt custom speakers.
Typical shape:
- - model type: INLINECODE29
- INLINECODE30 may be accepted either as a plain string or as
{"id":"..."} depending on the server - for this workspace,
lj-qwen3-tts / /models/lj-qwen3-tts/ must use speaker/voice INLINECODE34 - do not send clone payloads
VoiceDesign
Use style-description-driven synthesis.
Typical shape:
- - model type: INLINECODE35
- must provide INLINECODE36
- do not send
voice, ref_audio, or INLINECODE39
5. Treat streaming as a real transport choice
This server supports real incremental generation, not fake post-hoc slicing.
Important behavior:
- - Current OpenAPI says
stream defaults to INLINECODE41 - INLINECODE42 defaults to INLINECODE43
- INLINECODE44 defaults to INLINECODE45
- Required fields are only
model and INLINECODE47 - Extra optional fields exposed by this local server include
instruct, voice, speed, gender, pitch, lang_code, ref_audio, ref_text, temperature, top_p, top_k, repetition_penalty, response_format, stream, streaming_interval, max_tokens, and INLINECODE64
Do not assume OpenAI parity on names or defaults — check the local OpenAPI schema first.
6. Use consent uploads properly
For consent-based clone flows, upload voice material through /v1/audio/voice_consents.
Use ref_text with the recording. That is not optional in spirit, even if a workflow tries to pretend otherwise.
If later synthesis depends on stored consent voices, verify that the saved identity actually maps to both:
- - reference audio
- reference text
7. OpenClaw-specific debugging pattern
When OpenClaw TTS appears broken:
- 1. Confirm
messages.tts points at the actual configured endpoint in INLINECODE68 - Confirm the intended model exists in
/v1/models or is otherwise accepted by the server; if not, check whether it is a local-path-backed deployment such as INLINECODE70 - Confirm the selected provider is really the OpenAI-compatible path and not Microsoft fallback
- Test direct
curl with the same effective model/voice/mode assumptions - Inspect whether OpenClaw is falling back to another provider
- If using
[[tts:...]], verify whether single-reply override keys (model, voice, maybe provider) are enabled and are being honored - If needed, compare raw request shape with a dump proxy
If OpenClaw reaches the server successfully, the next question is usually which mode did it actually request.
8. Preferred test ladder
Use this order:
- 1. INLINECODE76
- INLINECODE77
- direct non-streaming TTS test
- direct streaming TTS test
- consent upload test if clone is involved
- OpenAI client compatibility test if relevant
- OpenClaw integration test
- dump-proxy / log inspection only if still ambiguous
9. Common conclusions
Server good, integration bad
Typical signs:
- - manual
curl returns playable audio - OpenClaw output sounds like fallback voice or wrong mode
- provider selection is inconsistent
Conclusion: fix integration, not inference.
Text normalization bug
Typical signs:
- - synthesis succeeds technically
- numbers are read awkwardly, skipped, or glitched
Conclusion: normalize the spoken text first. Do not blame the transport layer for a prompt-content problem.
Mode mismatch
Typical signs:
- - clone request sent to CustomVoice
- VoiceDesign called without INLINECODE79
- only
ref_audio present for Base clone
Conclusion: wrong request semantics for the chosen model type.
10. Use the reference doc when exact fields matter
Read references/tts-api.md when you need exact behavior for:
- - INLINECODE82
- INLINECODE83
- streaming vs non-streaming
- INLINECODE84 vs INLINECODE85
- mode selection and response headers
- consent storage semantics
- exact model/request mismatch errors
Do not assume generic OpenAI TTS docs fully match this local server.
Resources
references/
- -
references/tts-api.md — exact local API behavior, streaming semantics, mode rules, consent upload flow, and common error conditions
本地TTS工作流
使用此技能调试实际语音管道并准备文本,使模型能够合理朗读。
不要盲目硬编码127.0.0.1。首先读取当前OpenClaw配置,使用当前的messages.tts.openai.baseUrl作为真实来源。
当前工作区已知部署:http://127.0.0.1:8000/v1
值得记住的当前本地模型路径回退:如果服务器未通过注册名称拉取模型,它可能直接从本地路径加载,例如./models/qwen3-tts-0.6b-mlx。
当精确路由形状重要时,本地OpenAPI文档位于:
- - http://localhost:8000/openapi.json
使用此OpenAPI文档作为模式/参考源,将此本地mlx-audio服务器与OpenAI的API进行比较。不要将其视为健康检查。
核心规则:合成前规范化数字
如果文本旨在朗读,不要在最终TTS输入中保留阿拉伯数字。
先将其转换为单词。
示例:
- - 中文输出:写一二三,而不是123
- 英文输出:写one two three,而不是123
此规则很重要,因为当输入原始数字时,TTS模型可能会表现异常或错误读取数字。
在准备朗读文本时,规范化:
- - 日期
- 时间
- 计数
- 版本类字符串(如果将被朗读)
- 中英文混合数字片段
如果需要保留精确的机器可读格式,保留一份用于显示,另一份规范化的副本用于TTS。
工作流程
1. 在接触OpenClaw之前验证服务器
首先读取~/.openclaw/openclaw.json并提取:
- - messages.tts.provider
- messages.tts.openai.baseUrl
- messages.tts.openai.model
- messages.tts.openai.voice
对照实际配置的主机检查基本信息:
bash
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
确认预期的TTS模型存在。
如果模型未通过拉取的注册名称出现,不要认为TTS已损坏——此服务器可能正在加载本地路径模型,例如./models/qwen3-tts-0.6b-mlx。
如果服务器是任务门控的,确保启用TTS:
bash
MLXAUDIOSERVER_TASKS=tts uv run python server.py
2. 证明原始TTS端点工作
始终将服务器与客户端堆栈隔离。
最小非流式测试:
bash
curl http://127.0.0.1:8000/v1/audio/speech \
-X POST \
-H Content-Type: application/json \
-d {
model: /models/lj-qwen3-tts/,
voice: lj,
input: 你好,这是一次性返回完整音频的测试。,
response_format: wav,
stream: false
} \
--output sample.wav
基本流式测试:
bash
curl http://127.0.0.1:8000/v1/audio/speech \
-H Content-Type: application/json \
-X POST \
-d {
model: /models/lj-qwen3-tts/,
voice: lj,
input: 你好,这是实时流式语音合成测试。,
response_format: wav,
stream: true,
streaming_interval: 2.0
} \
| ffplay -i -
如果直接curl工作但OpenClaw不工作,则错误可能出在TTS集成或提供者选择层,而不是TTS后端。
3. 区分服务器故障与集成故障
使用此规则:
- - 直接curl失败 → 先修复本地TTS服务器
- 直接curl工作,但OpenClaw声音异常或回退 → 检查OpenClaw提供者选择、回退和请求形状
- OpenClaw发送请求但语音/模式错误 → 检查model、voice、instructions、refaudio、reftext和流式标志等字段
4. 了解四种TTS模式
为正确的模型类型使用正确的请求形状。
基础说话人
使用内置说话人播放。
典型形状:
- - 模型类型:base
- 无完整refaudio + reftext
- voice.id表示内置说话人名称
基础克隆
使用克隆式合成。
典型形状:
- - 模型类型:base
- 必须同时提供refaudio和reftext,或提供可解析为两者的同意语音身份
硬性规则:不要仅使用ref_audio尝试克隆。
自定义语音
使用具有预构建自定义说话人的模型。
典型形状:
- - 模型类型:custom_voice
- voice可以作为纯字符串或{id:...}接受,具体取决于服务器
- 对于此工作区,lj-qwen3-tts / /models/lj-qwen3-tts/必须使用说话人/语音lj
- 不要发送克隆负载
语音设计
使用风格描述驱动的合成。
典型形状:
- - 模型类型:voicedesign
- 必须提供instructions
- 不要发送voice、refaudio或ref_text
5. 将流式视为真正的传输选择
此服务器支持真正的增量生成,而不是虚假的事后切片。
重要行为:
- - 当前OpenAPI表示stream默认为false
- responseformat默认为mp3
- streaminginterval默认为2.0
- 必需字段仅为model和input
- 此本地服务器暴露的额外可选字段包括instruct、voice、speed、gender、pitch、langcode、refaudio、reftext、temperature、topp、topk、repetitionpenalty、responseformat、stream、streaminginterval、max_tokens和verbose
不要假设OpenAI在名称或默认值上具有对等性——首先检查本地OpenAPI模式。
6. 正确使用同意上传
对于基于同意的克隆流程,通过/v1/audio/voice_consents上传语音素材。
使用带有录音的ref_text。这在精神上不是可选的,即使工作流程试图假装不是。
如果后续合成依赖于存储的同意语音,验证保存的身份确实映射到:
7. OpenClaw特定调试模式
当OpenClaw TTS出现问题时:
- 1. 确认messages.tts指向openclaw.json中实际配置的端点
- 确认预期模型存在于/v1/models中或以其他方式被服务器接受;如果不是,检查是否为本地路径支持的部署,例如./models/qwen3-tts-0.6b-mlx
- 确认所选提供者确实是OpenAI兼容路径,而不是Microsoft回退
- 使用相同的有效模型/语音/模式假设测试直接curl
- 检查OpenClaw是否回退到其他提供者
- 如果使用[[tts:...]],验证单回复覆盖键(model、voice,可能还有provider)是否已启用并被遵守
- 如果需要,将原始请求形状与转储代理进行比较
如果OpenClaw成功到达服务器,下一个问题通常是它实际请求了哪种模式。
8. 首选测试阶梯
按此顺序使用:
- 1. GET /health
- GET /v1/models
- 直接非流式TTS测试
- 直接流式TTS测试
- 如果涉及克隆,进行同意上传测试
- 如果相关,进行OpenAI客户端兼容性测试
- OpenClaw集成测试
- 仅在仍不明确时进行转储代理/日志检查
9. 常见结论
服务器正常,集成异常
典型迹象:
- - 手动curl返回可播放音频
- OpenClaw输出听起来像回退语音或错误模式
- 提供者选择不一致
结论:修复集成,而非推理。
文本规范化错误
典型迹象:
结论:先规范化朗读文本。不要将提示内容问题归咎于传输层。
模式不匹配
典型迹象:
- - 克隆请求发送到CustomVoice
- 调用VoiceDesign时没有instructions
- 基础克隆仅存在ref_audio
结论:所选模型类型的请求语义错误。
10. 当精确字段重要时使用参考文档
当需要以下精确行为时,阅读references/tts-api.md:
- - /v1/audio/speech
- /v1/audio/voice_