Local TTS Workflow

Use this skill to debug the actual speech pipeline and to prepare text so the model reads it sanely.

Do not hardcode 127.0.0.1 blindly. Read the active OpenClaw config first and use the current messages.tts.openai.baseUrl as the source of truth.

Current known deployment in this workspace: http://127.0.0.1:8000/v1.

Current local model-path fallback worth remembering: if the server did not pull a model by registry name, it may be loading directly from a local path such as ./models/qwen3-tts-0.6b-mlx.

When exact route shape matters, the local OpenAPI document is available at:

- INLINECODE4

Use this OpenAPI doc as a schema/reference source to compare this local mlx-audio server against OpenAI’s API. Do not treat it as a health check.

Core rule: normalize numbers before synthesis

If text is meant to be spoken aloud, do not leave Arabic numerals in the final TTS input.

Convert them into words first.

Examples:

- Chinese output: write 一二三, not INLINECODE7
English output: write one two three, not INLINECODE9

This rule matters because the TTS model can go weird or read digits badly when fed raw numerals.

When preparing spoken text, normalize:

- dates
times
counts
version-like strings if they will be read aloud
mixed Chinese/English numeric snippets

If preserving exact machine-readable formatting matters, keep one copy for display and a separate normalized copy for TTS.

Workflow

1. Verify the server before touching OpenClaw

Read ~/.openclaw/openclaw.json first and extract:

- INLINECODE11
INLINECODE12
INLINECODE13
INLINECODE14

Check the basics against the actual configured host:

CODEBLOCK0

Confirm that the intended TTS model exists.

If the model does not appear by pulled registry name, do not assume TTS is broken — this server may be loading a local-path model such as ./models/qwen3-tts-0.6b-mlx.

If the server is task-gated, ensure TTS is enabled:

CODEBLOCK1

2. Prove the raw TTS endpoint works

Always isolate the server from the client stack.

Minimal non-streaming test:

CODEBLOCK2

Basic streaming test:

CODEBLOCK3

If direct curl works but OpenClaw does not, the bug is probably in the TTS integration or provider selection layer, not the TTS backend.

3. Distinguish server failure from integration failure

Use this rule:

- Direct curl fails → fix the local TTS server first
Direct curl works, but OpenClaw sounds wrong or falls back → inspect OpenClaw provider selection, fallback, and request shape
OpenClaw sends requests but voice/mode is wrong → inspect fields like model, voice, instructions, ref_audio, ref_text, and streaming flags

4. Know the four TTS modes

Use the right request shape for the right model type.

Base speaker

Use built-in speaker playback.

Typical shape:

- model type: INLINECODE22
no full INLINECODE23
INLINECODE24 means built-in speaker name

Base clone

Use clone-style synthesis.

Typical shape:

- model type: INLINECODE25
must provide both ref_audio and ref_text, or supply a consent voice identity that resolves to both

Hard rule: do not attempt clone with only ref_audio.

CustomVoice

Use a model with prebuilt custom speakers.

Typical shape:

- model type: INLINECODE29
INLINECODE30 may be accepted either as a plain string or as {"id":"..."} depending on the server
for this workspace, lj-qwen3-tts / /models/lj-qwen3-tts/ must use speaker/voice INLINECODE34
do not send clone payloads

VoiceDesign

Use style-description-driven synthesis.

Typical shape:

- model type: INLINECODE35
must provide INLINECODE36
do not send voice, ref_audio, or INLINECODE39

5. Treat streaming as a real transport choice

This server supports real incremental generation, not fake post-hoc slicing.

Important behavior:

- Current OpenAPI says stream defaults to INLINECODE41
INLINECODE42 defaults to INLINECODE43
INLINECODE44 defaults to INLINECODE45
Required fields are only model and INLINECODE47
Extra optional fields exposed by this local server include instruct, voice, speed, gender, pitch, lang_code, ref_audio, ref_text, temperature, top_p, top_k, repetition_penalty, response_format, stream, streaming_interval, max_tokens, and INLINECODE64

Do not assume OpenAI parity on names or defaults — check the local OpenAPI schema first.

6. Use consent uploads properly

For consent-based clone flows, upload voice material through /v1/audio/voice_consents.

Use ref_text with the recording. That is not optional in spirit, even if a workflow tries to pretend otherwise.

If later synthesis depends on stored consent voices, verify that the saved identity actually maps to both:

- reference audio
reference text

7. OpenClaw-specific debugging pattern

When OpenClaw TTS appears broken:

1. Confirm messages.tts points at the actual configured endpoint in INLINECODE68
Confirm the intended model exists in /v1/models or is otherwise accepted by the server; if not, check whether it is a local-path-backed deployment such as INLINECODE70
Confirm the selected provider is really the OpenAI-compatible path and not Microsoft fallback
Test direct curl with the same effective model/voice/mode assumptions
Inspect whether OpenClaw is falling back to another provider
If using [[tts:...]], verify whether single-reply override keys (model, voice, maybe provider) are enabled and are being honored
If needed, compare raw request shape with a dump proxy

If OpenClaw reaches the server successfully, the next question is usually which mode did it actually request.

8. Preferred test ladder

Use this order:

1. INLINECODE76
INLINECODE77
direct non-streaming TTS test
direct streaming TTS test
consent upload test if clone is involved
OpenAI client compatibility test if relevant
OpenClaw integration test
dump-proxy / log inspection only if still ambiguous

9. Common conclusions

Server good, integration bad

Typical signs:

- manual curl returns playable audio
OpenClaw output sounds like fallback voice or wrong mode
provider selection is inconsistent

Conclusion: fix integration, not inference.

Text normalization bug

Typical signs:

- synthesis succeeds technically
numbers are read awkwardly, skipped, or glitched

Conclusion: normalize the spoken text first. Do not blame the transport layer for a prompt-content problem.

Mode mismatch

Typical signs:

- clone request sent to CustomVoice
VoiceDesign called without INLINECODE79
only ref_audio present for Base clone

Conclusion: wrong request semantics for the chosen model type.

10. Use the reference doc when exact fields matter

Read references/tts-api.md when you need exact behavior for:

- INLINECODE82
INLINECODE83
streaming vs non-streaming
INLINECODE84 vs INLINECODE85
mode selection and response headers
consent storage semantics
exact model/request mismatch errors

Do not assume generic OpenAI TTS docs fully match this local server.

Resources

references/

- references/tts-api.md — exact local API behavior, streaming semantics, mode rules, consent upload flow, and common error conditions

本地TTS工作流

使用此技能调试实际语音管道并准备文本，使模型能够合理朗读。

不要盲目硬编码127.0.0.1。首先读取当前OpenClaw配置，使用当前的messages.tts.openai.baseUrl作为真实来源。

当前工作区已知部署：http://127.0.0.1:8000/v1

值得记住的当前本地模型路径回退：如果服务器未通过注册名称拉取模型，它可能直接从本地路径加载，例如./models/qwen3-tts-0.6b-mlx。

当精确路由形状重要时，本地OpenAPI文档位于：

- http://localhost:8000/openapi.json

使用此OpenAPI文档作为模式/参考源，将此本地mlx-audio服务器与OpenAI的API进行比较。不要将其视为健康检查。

核心规则：合成前规范化数字

如果文本旨在朗读，不要在最终TTS输入中保留阿拉伯数字。

先将其转换为单词。

示例：

- 中文输出：写一二三，而不是123
英文输出：写one two three，而不是123

此规则很重要，因为当输入原始数字时，TTS模型可能会表现异常或错误读取数字。

在准备朗读文本时，规范化：

- 日期
时间
计数
版本类字符串（如果将被朗读）
中英文混合数字片段

如果需要保留精确的机器可读格式，保留一份用于显示，另一份规范化的副本用于TTS。

工作流程

1. 在接触OpenClaw之前验证服务器

首先读取~/.openclaw/openclaw.json并提取：

- messages.tts.provider
messages.tts.openai.baseUrl
messages.tts.openai.model
messages.tts.openai.voice

对照实际配置的主机检查基本信息：

bash
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

确认预期的TTS模型存在。

如果模型未通过拉取的注册名称出现，不要认为TTS已损坏——此服务器可能正在加载本地路径模型，例如./models/qwen3-tts-0.6b-mlx。

如果服务器是任务门控的，确保启用TTS：

bash
MLXAUDIOSERVER_TASKS=tts uv run python server.py

2. 证明原始TTS端点工作

始终将服务器与客户端堆栈隔离。

最小非流式测试：

bash
curl http://127.0.0.1:8000/v1/audio/speech \
-X POST \
-H Content-Type: application/json \
-d {
model: /models/lj-qwen3-tts/,
voice: lj,
input: 你好，这是一次性返回完整音频的测试。,
response_format: wav,
stream: false
} \
--output sample.wav

基本流式测试：

bash
curl http://127.0.0.1:8000/v1/audio/speech \
-H Content-Type: application/json \
-X POST \
-d {
model: /models/lj-qwen3-tts/,
voice: lj,
input: 你好，这是实时流式语音合成测试。,
response_format: wav,
stream: true,
streaming_interval: 2.0
} \
| ffplay -i -

如果直接curl工作但OpenClaw不工作，则错误可能出在TTS集成或提供者选择层，而不是TTS后端。

3. 区分服务器故障与集成故障

使用此规则：

- 直接curl失败 → 先修复本地TTS服务器
直接curl工作，但OpenClaw声音异常或回退 → 检查OpenClaw提供者选择、回退和请求形状
OpenClaw发送请求但语音/模式错误 → 检查model、voice、instructions、refaudio、reftext和流式标志等字段

4. 了解四种TTS模式

为正确的模型类型使用正确的请求形状。

基础说话人

使用内置说话人播放。

典型形状：

- 模型类型：base
无完整refaudio + reftext
voice.id表示内置说话人名称

基础克隆

使用克隆式合成。

典型形状：

- 模型类型：base
必须同时提供refaudio和reftext，或提供可解析为两者的同意语音身份

硬性规则：不要仅使用ref_audio尝试克隆。

自定义语音

使用具有预构建自定义说话人的模型。

典型形状：

- 模型类型：custom_voice
voice可以作为纯字符串或{id:...}接受，具体取决于服务器
对于此工作区，lj-qwen3-tts / /models/lj-qwen3-tts/必须使用说话人/语音lj
不要发送克隆负载

语音设计

使用风格描述驱动的合成。

典型形状：

- 模型类型：voicedesign
必须提供instructions
不要发送voice、refaudio或ref_text

5. 将流式视为真正的传输选择

此服务器支持真正的增量生成，而不是虚假的事后切片。

重要行为：

- 当前OpenAPI表示stream默认为false
responseformat默认为mp3
streaminginterval默认为2.0
必需字段仅为model和input
此本地服务器暴露的额外可选字段包括instruct、voice、speed、gender、pitch、langcode、refaudio、reftext、temperature、topp、topk、repetitionpenalty、responseformat、stream、streaminginterval、max_tokens和verbose

不要假设OpenAI在名称或默认值上具有对等性——首先检查本地OpenAPI模式。

6. 正确使用同意上传

对于基于同意的克隆流程，通过/v1/audio/voice_consents上传语音素材。

使用带有录音的ref_text。这在精神上不是可选的，即使工作流程试图假装不是。

如果后续合成依赖于存储的同意语音，验证保存的身份确实映射到：

- 参考音频
参考文本

7. OpenClaw特定调试模式

当OpenClaw TTS出现问题时：

1. 确认messages.tts指向openclaw.json中实际配置的端点
确认预期模型存在于/v1/models中或以其他方式被服务器接受；如果不是，检查是否为本地路径支持的部署，例如./models/qwen3-tts-0.6b-mlx
确认所选提供者确实是OpenAI兼容路径，而不是Microsoft回退
使用相同的有效模型/语音/模式假设测试直接curl
检查OpenClaw是否回退到其他提供者
如果使用[[tts:...]]，验证单回复覆盖键（model、voice，可能还有provider）是否已启用并被遵守
如果需要，将原始请求形状与转储代理进行比较

如果OpenClaw成功到达服务器，下一个问题通常是它实际请求了哪种模式。

8. 首选测试阶梯

按此顺序使用：

1. GET /health
GET /v1/models
直接非流式TTS测试
直接流式TTS测试
如果涉及克隆，进行同意上传测试
如果相关，进行OpenAI客户端兼容性测试
OpenClaw集成测试
仅在仍不明确时进行转储代理/日志检查

9. 常见结论

服务器正常，集成异常

典型迹象：

- 手动curl返回可播放音频
OpenClaw输出听起来像回退语音或错误模式
提供者选择不一致

结论：修复集成，而非推理。

文本规范化错误

典型迹象：

- 合成技术上成功
数字朗读尴尬、跳过或出现故障

结论：先规范化朗读文本。不要将提示内容问题归咎于传输层。

模式不匹配

典型迹象：

- 克隆请求发送到CustomVoice
调用VoiceDesign时没有instructions
基础克隆仅存在ref_audio

结论：所选模型类型的请求语义错误。

10. 当精确字段重要时使用参考文档

当需要以下精确行为时，阅读references/tts-api.md：

- /v1/audio/speech
/v1/audio/voice_

local-tts-workflow本地TTS工作流