Qwen3-Audio

Overview

Qwen3-Audio is a high-performance audio processing library optimized for Apple Silicon (M1/M2/M3/M4). It delivers fast, efficient TTS and STT with support for multiple models, languages, and audio formats.

Prerequisites

- Python 3.10+
Apple Silicon Mac (M1/M2/M3/M4)

Environment checks

Before using any capability, verify that all items in ./references/env-check-list.md are complete.

Capabilities

Text to Speech

CODEBLOCK0

Returns (JSON):
CODEBLOCK1

Voice Cloning

Clone any voice using a reference audio sample. Provide the wav file and its transcript:

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --ref_audio "sample_audio.wav" --ref_text "This is what my voice sounds like."

ref_audio: reference audio to clone ref_text: transcript of the reference audio

Use Created Voice (Shortcut)

Use a voice created with voice create by its ID:

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --ref_voice "my-voice-id"

This automatically loads ref_audio and ref_text from the voice profile.

CustomVoice (Emotion Control)

Use predefined voices with emotion/style instructions: CODEBLOCK4

VoiceDesign (Create Any Voice)

Create any voice from a text description: CODEBLOCK5

Automatic Speech Recognition (STT)

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" stt --audio "/sample_audio.wav" --output "/path_to_save.txt" --output-format srt

Test audio: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav output-format: "txt" | "ass" | "srt" | "all"

Returns (JSON):
CODEBLOCK7

Voice Management

Voices are stored in the voices/ directory at the skill root level. Each voice has its own folder containing:

- ref_audio.wav - Reference audio file
INLINECODE6 - Reference text transcript
INLINECODE7 - Voice style description

Create a Voice

Create a reusable voice profile using VoiceDesign model. The --instruct parameter is required to describe the voice style:

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" voice create --text "This is a sample voice reference text." --instruct "A warm, friendly female voice with a professional tone." --language "English"

Optional: --id "my-voice-id" to specify a custom voice ID.

Returns (JSON):
CODEBLOCK9

List Voices

List all created voice profiles: CODEBLOCK10

Returns (JSON):
CODEBLOCK11

Use a Created Voice

After creating a voice, use it for TTS with the --ref_voice parameter. The instruct will be automatically loaded: CODEBLOCK12

Predefined Speakers (CustomVoice)

For Qwen3-TTS-12Hz-1.7B/0.6B-CustomVoice models, the supported speakers and their descriptions are listed below. We recommend using each speaker's native language for best quality. Each speaker can still speak any language supported by the model.

Speaker	Voice Description	Native Language
Vivian	Bright, slightly edgy young female voice.	Chinese
Serena

Released Models

Model	Features	Language Support	Instruction Control
Qwen3-TTS-12Hz-1.7B-VoiceDesign	Performs voice design based on user-provided descriptions.	Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian	✅
Qwen3-TTS-12Hz-1.7B-CustomVoice

Provides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ | | Qwen3-TTS-12Hz-1.7B-Base | Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | |

Qwen3-Audio

概述

Qwen3-Audio 是一个专为 Apple Silicon（M1/M2/M3/M4）优化的高性能音频处理库。它提供快速高效的文本转语音和语音转文本功能，支持多种模型、语言和音频格式。

前提条件

- Python 3.10 及以上版本
Apple Silicon Mac（M1/M2/M3/M4）

环境检查

在使用任何功能之前，请确认 ./references/env-check-list.md 中的所有项目均已完成。

功能

文本转语音

bash uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text hello world --output /pathtosave.wav

返回结果（JSON）：
json
{
audiopath: /pathto_save.wav,
duration: 1.234,
sample_rate: 24000
}

语音克隆

使用参考音频样本克隆任意语音。提供 wav 文件及其转录文本： bash uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text hello world --output /pathtosave.wav --refaudio sampleaudio.wav --ref_text This is what my voice sounds like.

ref_audio：用于克隆的参考音频
ref_text：参考音频的转录文本

使用已创建的语音（快捷方式）

通过语音 ID 使用通过 voice create 创建的语音： bash uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text hello world --output /pathtosave.wav --ref_voice my-voice-id

这将自动从语音配置文件中加载 refaudio 和 reftext。

自定义语音（情感控制）

使用带有情感/风格指令的预定义语音： bash uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text hello world --output /pathtosave.wav --speaker Ryan --language English --instruct Very happy and excited.

语音设计（创建任意语音）

通过文本描述创建任意语音： bash uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text hello world --output /pathtosave.wav --language English --instruct A cheerful young female voice with high pitch and energetic tone.

自动语音识别（语音转文本）

bash uv run --python .venv/bin/python ./scripts/mlx-audio.py stt --audio /sampleaudio.wav --output /pathto_save.txt --output-format srt

测试音频：https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav
output-format：txt | ass | srt | all

返回结果（JSON）：
json
{
text: transcribed text content,
duration: 10.5,
sample_rate: 16000,
files: [/pathtosave.txt, /pathtosave.srt]
}

语音管理

语音存储在技能根目录下的 voices/ 文件夹中。每个语音拥有自己的文件夹，包含：

- refaudio.wav - 参考音频文件
reftext.txt - 参考文本转录
ref_instruct.txt - 语音风格描述

创建语音

使用语音设计模型创建可复用的语音配置文件。--instruct 参数为必填项，用于描述语音风格： bash uv run --python .venv/bin/python ./scripts/mlx-audio.py voice create --text This is a sample voice reference text. --instruct A warm, friendly female voice with a professional tone. --language English

可选：使用 --id my-voice-id 指定自定义语音 ID。

返回结果（JSON）：
json
{
id: abc12345,
refaudio: /path/to/skill/voices/abc12345/refaudio.wav,
ref_text: This is a sample voice reference text.,
instruct: A warm, friendly female voice with a professional tone.,
duration: 3.456,
sample_rate: 24000
}

列出语音

列出所有已创建的语音配置文件： bash uv run --python .venv/bin/python ./scripts/mlx-audio.py voice list

返回结果（JSON）：
json
[
{
id: abc12345,
refaudio: /path/to/skill/voices/abc12345/refaudio.wav,
ref_text: This is a sample voice reference text.,
instruct: A warm, friendly female voice with a professional tone.,
duration: 3.456,
sample_rate: 24000
}
]

使用已创建的语音

创建语音后，使用 --ref_voice 参数进行文本转语音。指令将自动加载： bash uv run --python .venv/bin/python ./scripts/mlx-audio.py tts --text New text to speak --output /output.wav --ref_voice abc12345

预定义说话人（自定义语音）

对于 Qwen3-TTS-12Hz-1.7B/0.6B-CustomVoice 模型，支持的说话人及其描述如下。建议使用每个说话人的母语以获得最佳效果。每个说话人仍可讲模型支持的任何语言。

说话人	语音描述	母语
Vivian	明亮、略带锐利的年轻女声。	中文
Serena

已发布模型

模型	特性	语言支持	指令控制
Qwen3-TTS-12Hz-1.7B-VoiceDesign	根据用户提供的描述进行语音设计。	中文、英语、日语、韩语、德语、法语、俄语、葡萄牙语、西班牙语、意大利语	✅
Qwen3-TTS-12Hz-1.7B-CustomVoice

qwen3-audioQwen3音频库

qwen3-audio

Qwen3-Audio

Overview

Prerequisites

Environment checks

Capabilities

Text to Speech

Voice Cloning

Use Created Voice (Shortcut)

CustomVoice (Emotion Control)

VoiceDesign (Create Any Voice)

Automatic Speech Recognition (STT)

Voice Management

Create a Voice

List Voices

Use a Created Voice

Predefined Speakers (CustomVoice)

Released Models

Qwen3-Audio

概述

前提条件

环境检查

功能

文本转语音

语音克隆

使用已创建的语音（快捷方式）

自定义语音（情感控制）

语音设计（创建任意语音）

自动语音识别（语音转文本）

语音管理

创建语音

列出语音

使用已创建的语音

预定义说话人（自定义语音）

已发布模型

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement