Voice Clone Skill

A self-initializing, zero-configuration voice cloning skill. It manages a background TTS daemon that keeps heavy model weights in memory for fast inference. Supports multiple engines and unlimited text length.

Quick reference

Item	Value
Entry script	INLINECODE0
Output

When to use this skill

- The user sends a voice memo or audio file and you need to reply with audio.
The user says "read this aloud", "speak to me", "use my voice", "voice mode".
The conversation context implies a spoken reply is expected.
The user provides a reference audio and asks you to mimic their voice.

Step-by-step usage

1. Identify inputs

You need two things:

- ref_audio: The absolute local path to the user's reference audio file (the voice to clone). This is typically the audio file the user just sent, saved by the ASR system (e.g., openai-whisper).
text: The text content you want to speak. Generate this as you normally would — think of your reply, then voice it.

2. Run the synthesis

Execute this command:

CODEBLOCK0

Optional parameters:

- --speed 1.2 — Speak faster. Range: 0.5 to 2.0. Default: 1.0.
INLINECODE10 — Save the generated audio file to a specific absolute folder path. Default: server/generated_audio/.

Example with all options:
CODEBLOCK1

3. Handle the output

The script prints a single absolute path on stdout (e.g., /path/to/reply_a1b2c3d4.ogg).
Append it to your response using the attachment format:

CODEBLOCK2

4. Important constraints

- Do NOT manually start python app.py or manage the backend. The run_tts.sh script auto-detects, auto-installs, and auto-starts everything.
First run is slow (~30-60 seconds) because it downloads model weights and loads them into memory. Subsequent calls are fast.
Long texts work automatically. The engine splits text into sentences, synthesizes each chunk, and stitches them seamlessly. No length limit.

Controlling voice characteristics

Speed (all engines)

The --speed parameter adjusts speaking rate:

Value	Effect
INLINECODE16	Slow, deliberate, suitable for elderly listeners
INLINECODE17

F5-TTS supports speed natively. Other engines use ffmpeg post-processing (atempo filter), which gives good results but may slightly affect quality at extreme values.

Emotion and tone

These models use acoustic feature extraction from the reference audio — they do not accept text-based emotion tags like [happy] or [sad].

The emotion of the output is determined entirely by the reference audio.

To control emotion, select or prepare reference audio that carries the desired tone:

Desired tone	Reference audio strategy
Calm, neutral	Use a reference clip where the speaker talks normally
Excited, happy

Practical approach for Agents: If the user has sent multiple voice messages, choose the one whose emotional tone best matches the context of your reply. If only one reference is available, use it as-is — the model will approximate the speaker's general style.

ChatTTS Specifics: This engine supports inline emotion tags in text: [laugh], [uv_break] (pause). It also supports voice cloning when a reference audio is provided.

Available engines

Engine	ID	Install	Size	Clone	Speed support	Best for
F5-TTS	INLINECODE24	INLINECODE25	~1.5GB	✅	Native	Highest quality cloning
CosyVoice

Switch engines by setting the environment variable before the server starts:
CODEBLOCK3

Uninstalling

CODEBLOCK4

File structure

CODEBLOCK5

References

- Read references/architecture.md for system architecture and design rationale.

语音克隆技能

一个自初始化、零配置的语音克隆技能。它管理一个后台TTS守护进程，将重型模型权重保留在内存中以实现快速推理。支持多种引擎和无限文本长度。

快速参考

项目	值
入口脚本	bash scripts/runtts.sh --text ... --refaudio ... [--speed 1.0] [--output_dir ...]
输出

何时使用此技能

- 用户发送语音备忘录或音频文件，您需要用音频回复。
用户说读出来、跟我说话、用我的声音、语音模式。
对话上下文暗示期望语音回复。
用户提供参考音频并要求您模仿他们的声音。

分步使用指南

1. 识别输入

您需要两样东西：

- ref_audio：用户参考音频文件的本地绝对路径（要克隆的声音）。这通常是用户刚刚发送的音频文件，由ASR系统（例如openai-whisper）保存。
text：您想要朗读的文本内容。像平常一样生成——思考您的回复，然后将其语音化。

2. 运行合成

执行此命令：

bash
bash scripts/runtts.sh --text 您的回复文本。 --refaudio /绝对/路径/到/参考音频.ogg

可选参数：

- --speed 1.2 — 加快语速。范围：0.5到2.0。默认值：1.0。
--outputdir /tmp/ — 将生成的音频文件保存到特定的绝对文件夹路径。默认值：server/generatedaudio/。

包含所有选项的示例：
bash
bash scripts/run_tts.sh \
--text 很高兴认识你，这是我克隆后的声音。 \
--refaudio /tmp/uservoice_msg.ogg \
--speed 0.9

3. 处理输出

脚本在标准输出上打印单个绝对路径（例如 /path/to/reply_a1b2c3d4.ogg）。
使用附件格式将其附加到您的回复中：

MEDIA:/path/to/reply_a1b2c3d4.ogg

4. 重要限制

- 不要手动启动python app.py或管理后端。run_tts.sh脚本会自动检测、自动安装和自动启动所有内容。
首次运行较慢（约30-60秒），因为需要下载模型权重并将其加载到内存中。后续调用很快。
长文本自动处理。 引擎将文本拆分为句子，合成每个片段，并无缝拼接。没有长度限制。

控制语音特征

语速（所有引擎）

--speed参数调整说话速率：

值	效果
0.7	缓慢、从容，适合老年听众
1.0

F5-TTS原生支持语速。其他引擎使用ffmpeg后处理（atempo滤镜），效果良好但在极端值下可能略微影响质量。

情感和语调

这些模型使用参考音频的声学特征提取——它们不接受基于文本的情感标签，如[happy]或[sad]。

输出的情感完全由参考音频决定。

要控制情感，选择或准备带有期望语调的参考音频：

期望语调	参考音频策略
平静、中性	使用说话者正常说话的参考片段
兴奋、开心

对代理的实用建议： 如果用户发送了多条语音消息，选择情感最符合您回复上下文的那一条。如果只有一个参考可用，直接使用——模型会近似说话者的总体风格。

ChatTTS特性： 该引擎支持文本中的内联情感标签：[laugh]、[uv_break]（停顿）。当提供参考音频时，它也支持语音克隆。

可用引擎

引擎	ID	安装	大小	克隆	语速支持	最佳用途
F5-TTS	f5	bash scripts/autoinstaller.sh	~1.5GB	✅	原生	最高质量克隆
CosyVoice

通过在服务器启动前设置环境变量来切换引擎：
bash
export TTS_BACKEND=cosyvoice

卸载

bash

移除所有内容（虚拟环境、守护进程、注册）

bash scripts/uninstall.sh

仅移除一个引擎的源代码

bash scripts/uninstall.sh --engine cosyvoice

移除所有内容，包括下载的模型权重（数GB）

bash scripts/uninstall.sh --purge

文件结构

scripts/
├── run_tts.sh # 主入口点（自动修复、自动启动守护进程）
├── tts_client.py # 与后端通信的HTTP客户端
├── auto_installer.sh # 安装F5-TTS（默认）+ 注册技能
├── install_cosyvoice.sh # 安装CosyVoice引擎
├── install_chattts.sh # 安装ChatTTS引擎
├── install_openvoice.sh # 安装OpenVoice引擎
└── uninstall.sh # 清理脚本
server/
├── app.py # FastAPI守护进程（自动管理，不要手动启动）
├── core_tts.py # 多引擎工厂 + 长文本分块
└── requirements.txt # 基础依赖

参考资料

- 阅读 references/architecture.md 了解系统架构和设计原理。

voice-clone-bot语音克隆

voice-clone-bot

Voice Clone Skill

Quick reference

When to use this skill

Step-by-step usage

1. Identify inputs

2. Run the synthesis

3. Handle the output

4. Important constraints

Controlling voice characteristics

Speed (all engines)

Emotion and tone

Available engines

Uninstalling

File structure

References

语音克隆技能

快速参考

何时使用此技能

分步使用指南

1. 识别输入

2. 运行合成

3. 处理输出

4. 重要限制

控制语音特征

语速（所有引擎）

情感和语调

可用引擎

卸载

移除所有内容（虚拟环境、守护进程、注册）

仅移除一个引擎的源代码

移除所有内容，包括下载的模型权重（数GB）

文件结构

参考资料

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement