Qwen3-Audio
Overview
Qwen3-Audio is a high-performance audio processing library optimized for Apple Silicon (M1/M2/M3/M4). It delivers fast, efficient TTS and STT with support for multiple models, languages, and audio formats.
Prerequisites
- - Python 3.10+
- Apple Silicon Mac (M1/M2/M3/M4)
Environment checks
Before using any capability, verify that all items in ./references/env-check-list.md are complete.
Capabilities
Text to Speech
CODEBLOCK0
Returns (JSON):
CODEBLOCK1
Voice Cloning
Clone any voice using a reference audio sample. Provide the wav file and its transcript:
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --ref_audio "sample_audio.wav" --ref_text "This is what my voice sounds like."
ref_audio: reference audio to clone
ref_text: transcript of the reference audio
Use Created Voice (Shortcut)
Use a voice created with
voice create by its ID:
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --ref_voice "my-voice-id"
This automatically loads
ref_audio and
ref_text from the voice profile.
CustomVoice (Emotion Control)
Use predefined voices with emotion/style instructions:
CODEBLOCK4
VoiceDesign (Create Any Voice)
Create any voice from a text description:
CODEBLOCK5
Automatic Speech Recognition (STT)
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" stt --audio "/sample_audio.wav" --output "/path_to_save.txt" --output-format srt
Test audio: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav
output-format: "txt" | "ass" | "srt" | "all"
Returns (JSON):
CODEBLOCK7
Voice Management
Voices are stored in the voices/ directory at the skill root level. Each voice has its own folder containing:
- -
ref_audio.wav - Reference audio file - INLINECODE6 - Reference text transcript
- INLINECODE7 - Voice style description
Create a Voice
Create a reusable voice profile using VoiceDesign model. The
--instruct parameter is required to describe the voice style:
uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" voice create --text "This is a sample voice reference text." --instruct "A warm, friendly female voice with a professional tone." --language "English"
Optional:
--id "my-voice-id" to specify a custom voice ID.
Returns (JSON):
CODEBLOCK9
List Voices
List all created voice profiles:
CODEBLOCK10
Returns (JSON):
CODEBLOCK11
Use a Created Voice
After creating a voice, use it for TTS with the
--ref_voice parameter. The instruct will be automatically loaded:
CODEBLOCK12
Predefined Speakers (CustomVoice)
For Qwen3-TTS-12Hz-1.7B/0.6B-CustomVoice models, the supported speakers and their descriptions are listed below. We recommend using each speaker's native language for best quality. Each speaker can still speak any language supported by the model.
| Speaker | Voice Description | Native Language |
|---|
| Vivian | Bright, slightly edgy young female voice. | Chinese |
| Serena |
Warm, gentle young female voice. | Chinese |
| Uncle_Fu | Seasoned male voice with a low, mellow timbre. | Chinese |
| Dylan | Youthful Beijing male voice with a clear, natural timbre. | Chinese (Beijing Dialect) |
| Eric | Lively Chengdu male voice with a slightly husky brightness. | Chinese (Sichuan Dialect) |
| Ryan | Dynamic male voice with strong rhythmic drive. | English |
| Aiden | Sunny American male voice with a clear midrange. | English |
| Ono_Anna | Playful Japanese female voice with a light, nimble timbre. | Japanese |
| Sohee | Warm Korean female voice with rich emotion. | Korean |
Released Models
| Model | Features | Language Support | Instruction Control |
|---|
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | Performs voice design based on user-provided descriptions. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ |
| Qwen3-TTS-12Hz-1.7B-CustomVoice |
Provides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ |
| Qwen3-TTS-12Hz-1.7B-Base | Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | |
Qwen3-Audio
概述
Qwen3-Audio 是一个专为 Apple Silicon(M1/M2/M3/M4)优化的高性能音频处理库。它提供快速高效的文本转语音和语音转文本功能,支持多种模型、语言和音频格式。
前提条件
- - Python 3.10 及以上版本
- Apple Silicon Mac(M1/M2/M3/M4)
环境检查
在使用任何功能之前,请确认 ./references/env-check-list.md 中的所有项目均已完成。
功能
文本转语音
bash
uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text hello world --output /path
tosave.wav
返回结果(JSON):
json
{
audiopath: /pathto_save.wav,
duration: 1.234,
sample_rate: 24000
}
语音克隆
使用参考音频样本克隆任意语音。提供 wav 文件及其转录文本:
bash
uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text hello world --output /path
tosave.wav --ref
audio sampleaudio.wav --ref_text This is what my voice sounds like.
ref_audio:用于克隆的参考音频
ref_text:参考音频的转录文本
使用已创建的语音(快捷方式)
通过语音 ID 使用通过 voice create 创建的语音:
bash
uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text hello world --output /path
tosave.wav --ref_voice my-voice-id
这将自动从语音配置文件中加载 refaudio 和 reftext。
自定义语音(情感控制)
使用带有情感/风格指令的预定义语音:
bash
uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text hello world --output /path
tosave.wav --speaker Ryan --language English --instruct Very happy and excited.
语音设计(创建任意语音)
通过文本描述创建任意语音:
bash
uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text hello world --output /path
tosave.wav --language English --instruct A cheerful young female voice with high pitch and energetic tone.
自动语音识别(语音转文本)
bash
uv run --python .venv/bin/python ./scripts/mlx-audio.py stt --audio /sample
audio.wav --output /pathto_save.txt --output-format srt
测试音频:https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav
output-format:txt | ass | srt | all
返回结果(JSON):
json
{
text: transcribed text content,
duration: 10.5,
sample_rate: 16000,
files: [/pathtosave.txt, /pathtosave.srt]
}
语音管理
语音存储在技能根目录下的 voices/ 文件夹中。每个语音拥有自己的文件夹,包含:
- - refaudio.wav - 参考音频文件
- reftext.txt - 参考文本转录
- ref_instruct.txt - 语音风格描述
创建语音
使用语音设计模型创建可复用的语音配置文件。--instruct 参数为必填项,用于描述语音风格:
bash
uv run --python .venv/bin/python ./scripts/mlx-audio.py voice create --text This is a sample voice reference text. --instruct A warm, friendly female voice with a professional tone. --language English
可选:使用 --id my-voice-id 指定自定义语音 ID。
返回结果(JSON):
json
{
id: abc12345,
refaudio: /path/to/skill/voices/abc12345/refaudio.wav,
ref_text: This is a sample voice reference text.,
instruct: A warm, friendly female voice with a professional tone.,
duration: 3.456,
sample_rate: 24000
}
列出语音
列出所有已创建的语音配置文件:
bash
uv run --python .venv/bin/python ./scripts/mlx-audio.py voice list
返回结果(JSON):
json
[
{
id: abc12345,
refaudio: /path/to/skill/voices/abc12345/refaudio.wav,
ref_text: This is a sample voice reference text.,
instruct: A warm, friendly female voice with a professional tone.,
duration: 3.456,
sample_rate: 24000
}
]
使用已创建的语音
创建语音后,使用 --ref_voice 参数进行文本转语音。指令将自动加载:
bash
uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text New text to speak --output /output.wav --ref_voice abc12345
预定义说话人(自定义语音)
对于 Qwen3-TTS-12Hz-1.7B/0.6B-CustomVoice 模型,支持的说话人及其描述如下。建议使用每个说话人的母语以获得最佳效果。每个说话人仍可讲模型支持的任何语言。
| 说话人 | 语音描述 | 母语 |
|---|
| Vivian | 明亮、略带锐利的年轻女声。 | 中文 |
| Serena |
温暖、温柔的年轻女声。 | 中文 |
| Uncle_Fu | 成熟男声,音色低沉圆润。 | 中文 |
| Dylan | 年轻北京男声,音色清晰自然。 | 中文(北京方言) |
| Eric | 活泼成都男声,略带沙哑的明亮感。 | 中文(四川方言) |
| Ryan | 富有动感的男声,节奏感强。 | 英语 |
| Aiden | 阳光的美国男声,中音清晰。 | 英语 |
| Ono_Anna | 俏皮的日本女声,音色轻快灵活。 | 日语 |
| Sohee | 温暖的韩国女声,情感丰富。 | 韩语 |
已发布模型
| 模型 | 特性 | 语言支持 | 指令控制 |
|---|
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | 根据用户提供的描述进行语音设计。 | 中文、英语、日语、韩语、德语、法语、俄语、葡萄牙语、西班牙语、意大利语 | ✅ |
| Qwen3-TTS-12Hz-1.7B-CustomVoice |
通过用户指令对目标音色进行风格控制;支持9种优质音色,涵盖性别、年龄、语言和方言的多种组合。 | 中文、英语、日语、韩语、德语、法语、俄语、葡萄牙语、西班牙语、意大利语 | ✅ |
| Qwen3-TTS-12Hz-1.7B-Base | 基础模型,能够从用户音频输入进行3秒快速语音克隆;可用于微调其他模型。 | 中文、英语、日语、韩语、德语、法语、俄语、葡萄牙语、西班牙语、意大利语 | |