Voice Clone Skill
A self-initializing, zero-configuration voice cloning skill. It manages a background TTS daemon that keeps heavy model weights in memory for fast inference. Supports multiple engines and unlimited text length.
Quick reference
| Item | Value |
|---|
| Entry script | INLINECODE0 |
| Output |
Single line: absolute path to generated
.ogg file |
| Attachment format |
MEDIA:<output_path> |
| Default engine | F5-TTS (env
TTS_BACKEND=f5) |
| Host/Port config |
.env (
TTS_SERVER_HOST,
TTS_SERVER_PORT) |
When to use this skill
- - The user sends a voice memo or audio file and you need to reply with audio.
- The user says "read this aloud", "speak to me", "use my voice", "voice mode".
- The conversation context implies a spoken reply is expected.
- The user provides a reference audio and asks you to mimic their voice.
Step-by-step usage
1. Identify inputs
You need two things:
- -
ref_audio: The absolute local path to the user's reference audio file (the voice to clone). This is typically the audio file the user just sent, saved by the ASR system (e.g., openai-whisper). text: The text content you want to speak. Generate this as you normally would — think of your reply, then voice it.
2. Run the synthesis
Execute this command:
CODEBLOCK0
Optional parameters:
- -
--speed 1.2 — Speak faster. Range: 0.5 to 2.0. Default: 1.0. - INLINECODE10 — Save the generated audio file to a specific absolute folder path. Default:
server/generated_audio/.
Example with all options:
CODEBLOCK1
3. Handle the output
The script prints a single absolute path on stdout (e.g., /path/to/reply_a1b2c3d4.ogg).
Append it to your response using the attachment format:
CODEBLOCK2
4. Important constraints
- - Do NOT manually start
python app.py or manage the backend. The run_tts.sh script auto-detects, auto-installs, and auto-starts everything. - First run is slow (~30-60 seconds) because it downloads model weights and loads them into memory. Subsequent calls are fast.
- Long texts work automatically. The engine splits text into sentences, synthesizes each chunk, and stitches them seamlessly. No length limit.
Controlling voice characteristics
Speed (all engines)
The --speed parameter adjusts speaking rate:
| Value | Effect |
|---|
| INLINECODE16 | Slow, deliberate, suitable for elderly listeners |
| INLINECODE17 |
Natural conversational speed (default) |
|
1.3 | Brisk, suitable for news or briefings |
|
1.5+ | Fast, compressed delivery |
F5-TTS supports speed natively. Other engines use ffmpeg post-processing (atempo filter), which gives good results but may slightly affect quality at extreme values.
Emotion and tone
These models use acoustic feature extraction from the reference audio — they do not accept text-based emotion tags like [happy] or [sad].
The emotion of the output is determined entirely by the reference audio.
To control emotion, select or prepare reference audio that carries the desired tone:
| Desired tone | Reference audio strategy |
|---|
| Calm, neutral | Use a reference clip where the speaker talks normally |
| Excited, happy |
Use a reference clip where the speaker sounds enthusiastic |
| Angry, intense | Use a reference clip with raised voice and sharp intonation |
| Sad, melancholic | Use a reference clip with slow, downcast delivery |
| Whispering | Use a reference clip where the speaker whispers |
Practical approach for Agents: If the user has sent multiple voice messages, choose the one whose emotional tone best matches the context of your reply. If only one reference is available, use it as-is — the model will approximate the speaker's general style.
ChatTTS Specifics: This engine supports inline emotion tags in text: [laugh], [uv_break] (pause). It also supports voice cloning when a reference audio is provided.
Available engines
| Engine | ID | Install | Size | Clone | Speed support | Best for |
|---|
| F5-TTS | INLINECODE24 | INLINECODE25 | ~1.5GB | ✅ | Native | Highest quality cloning |
| CosyVoice |
cosyvoice |
bash scripts/install_cosyvoice.sh | ~1.5GB | ✅ | ffmpeg | Natural Chinese prosody |
|
ChatTTS |
chattts |
bash scripts/install_chattts.sh | ~400MB | ✅ | ffmpeg | Dialogue with emotion tags |
|
OpenVoice |
openvoice |
bash scripts/install_openvoice.sh | ~300MB | ✅ | ffmpeg | Ultra fast, tiny footprint |
Switch engines by setting the environment variable before the server starts:
CODEBLOCK3
Uninstalling
CODEBLOCK4
File structure
CODEBLOCK5
References
- - Read
references/architecture.md for system architecture and design rationale.
语音克隆技能
一个自初始化、零配置的语音克隆技能。它管理一个后台TTS守护进程,将重型模型权重保留在内存中以实现快速推理。支持多种引擎和无限文本长度。
快速参考
| 项目 | 值 |
|---|
| 入口脚本 | bash scripts/runtts.sh --text ... --refaudio ... [--speed 1.0] [--output_dir ...] |
| 输出 |
单行:生成的.ogg文件的绝对路径 |
| 附件格式 | MEDIA:<输出路径> |
| 默认引擎 | F5-TTS(环境变量TTS_BACKEND=f5) |
| 主机/端口配置 | .env(TTS
SERVERHOST、TTS
SERVERPORT) |
何时使用此技能
- - 用户发送语音备忘录或音频文件,您需要用音频回复。
- 用户说读出来、跟我说话、用我的声音、语音模式。
- 对话上下文暗示期望语音回复。
- 用户提供参考音频并要求您模仿他们的声音。
分步使用指南
1. 识别输入
您需要两样东西:
- - ref_audio:用户参考音频文件的本地绝对路径(要克隆的声音)。这通常是用户刚刚发送的音频文件,由ASR系统(例如openai-whisper)保存。
- text:您想要朗读的文本内容。像平常一样生成——思考您的回复,然后将其语音化。
2. 运行合成
执行此命令:
bash
bash scripts/runtts.sh --text 您的回复文本。 --refaudio /绝对/路径/到/参考音频.ogg
可选参数:
- - --speed 1.2 — 加快语速。范围:0.5到2.0。默认值:1.0。
- --outputdir /tmp/ — 将生成的音频文件保存到特定的绝对文件夹路径。默认值:server/generatedaudio/。
包含所有选项的示例:
bash
bash scripts/run_tts.sh \
--text 很高兴认识你,这是我克隆后的声音。 \
--refaudio /tmp/uservoice_msg.ogg \
--speed 0.9
3. 处理输出
脚本在标准输出上打印单个绝对路径(例如 /path/to/reply_a1b2c3d4.ogg)。
使用附件格式将其附加到您的回复中:
MEDIA:/path/to/reply_a1b2c3d4.ogg
4. 重要限制
- - 不要手动启动python app.py或管理后端。run_tts.sh脚本会自动检测、自动安装和自动启动所有内容。
- 首次运行较慢(约30-60秒),因为需要下载模型权重并将其加载到内存中。后续调用很快。
- 长文本自动处理。 引擎将文本拆分为句子,合成每个片段,并无缝拼接。没有长度限制。
控制语音特征
语速(所有引擎)
--speed参数调整说话速率:
自然对话速度(默认) |
| 1.3 | 轻快,适合新闻或简报 |
| 1.5+ | 快速、紧凑的传达 |
F5-TTS原生支持语速。其他引擎使用ffmpeg后处理(atempo滤镜),效果良好但在极端值下可能略微影响质量。
情感和语调
这些模型使用参考音频的声学特征提取——它们不接受基于文本的情感标签,如[happy]或[sad]。
输出的情感完全由参考音频决定。
要控制情感,选择或准备带有期望语调的参考音频:
| 期望语调 | 参考音频策略 |
|---|
| 平静、中性 | 使用说话者正常说话的参考片段 |
| 兴奋、开心 |
使用说话者听起来热情的参考片段 |
| 愤怒、强烈 | 使用提高音量、语调尖锐的参考片段 |
| 悲伤、忧郁 | 使用语速缓慢、情绪低落的参考片段 |
| 低语 | 使用说话者低语的参考片段 |
对代理的实用建议: 如果用户发送了多条语音消息,选择情感最符合您回复上下文的那一条。如果只有一个参考可用,直接使用——模型会近似说话者的总体风格。
ChatTTS特性: 该引擎支持文本中的内联情感标签:[laugh]、[uv_break](停顿)。当提供参考音频时,它也支持语音克隆。
可用引擎
| 引擎 | ID | 安装 | 大小 | 克隆 | 语速支持 | 最佳用途 |
|---|
| F5-TTS | f5 | bash scripts/autoinstaller.sh | ~1.5GB | ✅ | 原生 | 最高质量克隆 |
| CosyVoice |
cosyvoice | bash scripts/installcosyvoice.sh | ~1.5GB | ✅ | ffmpeg | 自然中文韵律 |
|
ChatTTS | chattts | bash scripts/install_chattts.sh | ~400MB | ✅ | ffmpeg | 带情感标签的对话 |
|
OpenVoice | openvoice | bash scripts/install_openvoice.sh | ~300MB | ✅ | ffmpeg | 超快、小体积 |
通过在服务器启动前设置环境变量来切换引擎:
bash
export TTS_BACKEND=cosyvoice
卸载
bash
移除所有内容(虚拟环境、守护进程、注册)
bash scripts/uninstall.sh
仅移除一个引擎的源代码
bash scripts/uninstall.sh --engine cosyvoice
移除所有内容,包括下载的模型权重(数GB)
bash scripts/uninstall.sh --purge
文件结构
scripts/
├── run_tts.sh # 主入口点(自动修复、自动启动守护进程)
├── tts_client.py # 与后端通信的HTTP客户端
├── auto_installer.sh # 安装F5-TTS(默认)+ 注册技能
├── install_cosyvoice.sh # 安装CosyVoice引擎
├── install_chattts.sh # 安装ChatTTS引擎
├── install_openvoice.sh # 安装OpenVoice引擎
└── uninstall.sh # 清理脚本
server/
├── app.py # FastAPI守护进程(自动管理,不要手动启动)
├── core_tts.py # 多引擎工厂 + 长文本分块
└── requirements.txt # 基础依赖
参考资料
- - 阅读 references/architecture.md 了解系统架构和设计原理。