Qwen TTS
Local text-to-speech using Hugging Face's Qwen3-TTS-12Hz-1.7B-CustomVoice model.
Quick Start
Generate speech from text:
CODEBLOCK0
With voice instruction (emotion/style):
CODEBLOCK1
Different speaker:
CODEBLOCK2
Installation
First-time setup (one-time):
CODEBLOCK3
This creates a local virtual environment and installs qwen-tts package (~500MB).
Note: First synthesis downloads ~1.7GB model from Hugging Face automatically.
Usage
CODEBLOCK4
Options
- -
-o, --output PATH - Output file path (default: qwen_output.wav) - INLINECODE2 - Speaker voice (default: Vivian)
- INLINECODE3 - Language (default: Auto)
- INLINECODE4 - Voice instruction (emotion, style, tone)
- INLINECODE5 - Show available speakers
- INLINECODE6 - Model name (default: CustomVoice 1.7B)
Examples
Basic Italian speech:
CODEBLOCK5
With emotion/instruction:
CODEBLOCK6
Different speaker:
CODEBLOCK7
List available speakers:
CODEBLOCK8
Available Speakers
The CustomVoice model includes 9 premium voices:
| Speaker | Language | Description |
|---|
| Vivian | Chinese | Bright, slightly edgy young female |
| Serena |
Chinese | Warm, gentle young female |
| Uncle_Fu | Chinese | Seasoned male, low mellow timbre |
| Dylan | Chinese (Beijing) | Youthful Beijing male, clear |
| Eric | Chinese (Sichuan) | Lively Chengdu male, husky |
| Ryan | English | Dynamic male, rhythmic |
| Aiden | English | Sunny American male |
| Ono_Anna | Japanese | Playful female, light nimble |
| Sohee | Korean | Warm female, rich emotion |
Recommendation: Use each speaker's native language for best quality, though all speakers support all 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian).
Voice Instructions
Use -i, --instruct to control emotion, tone, and style:
Italian examples:
- - INLINECODE8
- INLINECODE9
- INLINECODE10
- INLINECODE11
English examples:
- - INLINECODE12
- INLINECODE13
- INLINECODE14
- INLINECODE15
Integration with OpenClaw
The script outputs the audio file path to stdout (last line), making it compatible with OpenClaw's TTS workflow:
CODEBLOCK9
Performance
- - GPU (CUDA): ~1-3 seconds for short phrases
- CPU: ~10-30 seconds for short phrases
- Model size: ~1.7GB (auto-downloads on first run)
- Venv size: ~500MB (installed dependencies)
Troubleshooting
Setup fails:
CODEBLOCK10
Model download slow/fails:
CODEBLOCK11
Out of memory (GPU):
The model automatically falls back to CPU if GPU memory insufficient.
Audio quality issues:
- - Try different speaker: INLINECODE16
- Add instruction: INLINECODE17
- Check language matches text:
-l Italian for Italian text
Model Details
- - Model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
- Source: Hugging Face (https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice)
- License: Check model card for current license terms
- Sample Rate: 16kHz
- Output Format: WAV (uncompressed)
Qwen TTS
使用Hugging Face的Qwen3-TTS-12Hz-1.7B-CustomVoice模型进行本地文本转语音。
快速开始
从文本生成语音:
bash
scripts/tts.py Ciao, come va? -l Italian -o output.wav
带语音指令(情感/风格):
bash
scripts/tts.py Sono felice! -i Parla con entusiasmo -l Italian -o happy.wav
不同说话人:
bash
scripts/tts.py Hello world -s Ryan -l English -o hello.wav
安装
首次设置(一次性):
bash
cd skills/public/qwen-tts
bash scripts/setup.sh
这将创建本地虚拟环境并安装qwen-tts包(约500MB)。
注意: 首次合成时会自动从Hugging Face下载约1.7GB的模型。
使用方法
bash
scripts/tts.py [选项] 要朗读的文本
选项
- - -o, --output PATH - 输出文件路径(默认:qwen_output.wav)
- -s, --speaker NAME - 说话人声音(默认:Vivian)
- -l, --language LANG - 语言(默认:自动)
- -i, --instruct TEXT - 语音指令(情感、风格、语调)
- --list-speakers - 显示可用说话人
- --model NAME - 模型名称(默认:CustomVoice 1.7B)
示例
基础意大利语语音:
bash
scripts/tts.py Benvenuto nel futuro del text-to-speech -l Italian -o welcome.wav
带情感/指令:
bash
scripts/tts.py Sono molto felice di vederti! -i Parla con entusiasmo e gioia -l Italian -o happy.wav
不同说话人:
bash
scripts/tts.py Hello, nice to meet you -s Ryan -l English -o ryan.wav
列出可用说话人:
bash
scripts/tts.py --list-speakers
可用说话人
CustomVoice模型包含9种优质声音:
| 说话人 | 语言 | 描述 |
|---|
| Vivian | 中文 | 明亮、略带锐气的年轻女性 |
| Serena |
中文 | 温暖、温柔的年轻女性 |
| Uncle_Fu | 中文 | 成熟男性,低沉圆润的音色 |
| Dylan | 中文(北京) | 青春活力的北京男性,清晰 |
| Eric | 中文(四川) | 活泼的成都男性,略带沙哑 |
| Ryan | 英语 | 富有活力的男性,有节奏感 |
| Aiden | 英语 | 阳光的美国男性 |
| Ono_Anna | 日语 | 俏皮的女性,轻快灵活 |
| Sohee | 韩语 | 温暖的女性,情感丰富 |
建议: 使用每个说话人的母语可获得最佳效果,但所有说话人均支持全部10种语言(中文、英语、日语、韩语、德语、法语、俄语、葡萄牙语、西班牙语、意大利语)。
语音指令
使用-i, --instruct控制情感、语调和风格:
意大利语示例:
- - Parla con entusiasmo
- Tono serio e professionale
- Voce calma e rilassante
- Leggi come un narratore
英语示例:
- - Speak with excitement
- Very happy and energetic
- Calm and soothing voice
- Read like a narrator
与OpenClaw集成
脚本将音频文件路径输出到stdout(最后一行),使其与OpenClaw的TTS工作流兼容:
bash
OpenClaw捕获输出路径
cd skills/public/qwen-tts
OUTPUT=$(scripts/tts.py Ciao -s Vivian -l Italian -o /tmp/audio.wav 2>/dev/null)
OUTPUT = /tmp/audio.wav
性能
- - GPU(CUDA): 短句约1-3秒
- CPU: 短句约10-30秒
- 模型大小: 约1.7GB(首次运行时自动下载)
- 虚拟环境大小: 约500MB(已安装依赖)
故障排除
设置失败:
bash
确保Python 3.10-3.12可用
python3.12 --version
重新运行设置
cd skills/public/qwen-tts
rm -rf venv
bash scripts/setup.sh
模型下载慢/失败:
bash
使用镜像(中国大陆)
export HF_ENDPOINT=https://hf-mirror.com
scripts/tts.py Test -o test.wav
内存不足(GPU):
如果GPU内存不足,模型会自动回退到CPU。
音频质量问题:
- - 尝试不同说话人:--list-speakers
- 添加指令:-i Speak clearly and slowly
- 检查语言是否与文本匹配:意大利语文本使用-l Italian
模型详情
- - 模型: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
- 来源: Hugging Face (https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice)
- 许可证: 查看模型卡片了解当前许可条款
- 采样率: 16kHz
- 输出格式: WAV(未压缩)