Baidu Intelligent Cloud Speech Synthesis Skill
Triggers
Use this skill when the user mentions:
- - "Convert this dialogue to audio using Baidu TTS"
- "Generate male-female dialogue, male voice using Duxiaoyao, female voice using Duxiaomei"
- "Batch process all dialogues in dialogue.txt"
- "Adjust speech rate to 7, pitch to 6"
- "View available voice list"
- "baidu tts", "dialogue to audio", "multi-speaker speech synthesis"
- "baidu speech synthesis", "multi-speaker dialogue", "Baidu TTS"
Chinese triggers (for Chinese users):
- - "用百度TTS把这段对话转成音频"
- "生成男女对话,男声用度逍遥,女声用度小美"
- "批量处理 dialogue.txt 里的所有对话"
- "调整语速到7,音调到6"
- "查看可用的音色列表"
Overview
This skill calls the Baidu Intelligent Cloud Speech Synthesis API, supporting multi-speaker dialogue synthesis (SSML mode or segment-merge fallback). It provides rich voice selection, speech rate/pitch/volume adjustment, and can automatically convert text dialogues into audio files with character-specific voices.
Installation Dependencies
CODEBLOCK0
Environment Variables Setup
Choose one of three authentication methods:
Method 1: API Key + Secret Key (auto-token)
CODEBLOCK1
Method 2: Direct access_token (starts with 1.)
CODEBLOCK2
Method 3: IAM Key (starts with bce-v3/)
CODEBLOCK3
Required Environment Variables
BAIDU_API_KEY must be set. Whether
BAIDU_SECRET_KEY is needed depends on the authentication method:
Method 1: API Key + Secret Key (auto-token)
CODEBLOCK4
Method 2: Direct access_token (starts with 1.)
CODEBLOCK5
Method 3: IAM Key (starts with bce-v3/)
CODEBLOCK6
The skill scripts automatically detect the key format and choose the corresponding authentication method. If not set, the user will be prompted.
Usage
1. Direct script invocation (command line)
CODEBLOCK7
2. Usage in OpenClaw sessions
When the user triggers the above phrases, the skill will:
- 1. Check environment variable configuration
- Ask or automatically identify input text/file
- Generate SSML according to default or specified voice assignment scheme
- Call the Baidu API and return the audio file (can be played automatically or saved)
File Structure
CODEBLOCK8
Technical Points
- - Intelligent Mode Selection: Automatically detects multi-voice requirements, defaults to segment synthesis mode (Baidu API only supports single-voice SSML).
- Segment Synthesis Solution: Splits multi-role dialogues into single-voice segments → synthesizes separately → merges with ffmpeg (solves API limitations, compatible with Python 3.13).
- SSML Single-Voice Support: Supports single-voice SSML (
tex_type=3) for complex speech expressions of individual characters. - Automatic Voice Assignment: Default mapping "老王" → Duxiaoyao (3), "张经理" → Duxiaoyu (1), "小李" → Duyaya (4), customizable via
--map. - Error Handling: Friendly prompts for network timeouts, quota exhaustion, audio merge failures, etc.
Notes
- - Free Quota: Baidu Speech Synthesis provides 5 million characters/month free quota (2026 latest policy), pay-as-you-go beyond that.
- Authentication Methods: Supports three authentication methods (API Key+Secret Key, access_token, IAM Key), automatically detected by skill.
- SSML Limitations: SSML text length limited to 1024 bytes (note Chinese character count), recommend each sentence not exceed 120 characters.
- Dependencies: Segment merge solution requires
ffmpeg installation (skill will detect and prompt). No need to install pydub. - Voice Expressiveness: Baidu's base voices are relatively flat; recommend enhancing dialogue expressiveness through text optimization (adding语气词, emotional descriptions).
- Key Security: Do not hardcode API keys in code; always use environment variables or
.env files. - Error Handling: Detailed guidance provided for authentication failures; refer to
references/api_setup.md for help.
Changelog
- - 2026‑03‑31 (v1.2.3): Fixed bare
except: statements in audio_merger.py; replaced with proper exception handling to improve debugging and error visibility. - 2026‑03‑26 (v1.2.2): Added MIT LICENSE file; updated metadata to declare ffmpeg dependency; addressing ClawHub security warnings.
- 2026‑03‑26 (v1.2.1): Complete English translation of skill documentation; improved bilingual triggers for both English and Chinese users.
- 2026‑03‑26 (v1.2): Switched to ffmpeg instead of pydub, solving Python 3.13 compatibility issues; corrected Baidu API limitation description (only supports single-voice SSML); optimized documentation and default voice mapping.
- 2026‑03‑26 (v1.1): Enhanced authentication support, added IAM Key and direct access_token authentication, updated free quota information, improved error guidance.
- 2026‑03‑26 (v1.0): Initial release, supporting multi-speaker dialogue synthesis, SSML/segment-merge dual modes.
百度智能云语音合成技能
触发条件
当用户提及以下内容时使用此技能:
- - 用百度TTS把这段对话转成音频
- 生成男女对话,男声用度逍遥,女声用度小美
- 批量处理 dialogue.txt 里的所有对话
- 调整语速到7,音调到6
- 查看可用的音色列表
- baidu tts, dialogue to audio, multi-speaker speech synthesis
- baidu speech synthesis, multi-speaker dialogue, Baidu TTS
中文触发词:
- - 用百度TTS把这段对话转成音频
- 生成男女对话,男声用度逍遥,女声用度小美
- 批量处理 dialogue.txt 里的所有对话
- 调整语速到7,音调到6
- 查看可用的音色列表
概述
本技能调用百度智能云语音合成API,支持多说话人对话合成(SSML模式或分段合并回退方案)。提供丰富的音色选择、语速/音调/音量调节,可自动将文本对话转换为带角色语音的音频文件。
安装依赖
bash
安装Python依赖
pip install requests
确保已安装ffmpeg(音频合并需要)
Ubuntu/Debian:
sudo apt install ffmpeg
macOS:
brew install ffmpeg
Windows: 从 https://ffmpeg.org/download.html 下载
可选:如果需要pydub(替代合并方案)
pip install pydub
环境变量设置
选择以下三种认证方式之一:
方式1:API Key + Secret Key(自动获取token)
bash
export BAIDU
APIKEY=您的API Key(非bce-v3格式)
export BAIDU
SECRETKEY=您的Secret Key
方式2:直接使用access_token(以1.开头)
bash
export BAIDU
APIKEY=1.a6b7dbd428f731035f771b8d
无需BAIDUSECRETKEY
方式3:IAM Key(以bce-v3/开头)
bash
export BAIDU
APIKEY=bce-v3/ALTAK-8h6t5Y7uI9o0P1q3W2e4R5t6Y7u8I9o0P
无需BAIDUSECRETKEY
注意:现有的bce-v3/ALTAK-...密钥可能专用于其他服务(如搜索)。
如果认证失败,请创建专用的语音合成应用以获取API Key + Secret Key。
必需的环境变量
必须设置BAIDU
APIKEY。是否需要BAIDU
SECRETKEY取决于认证方式:
方式1:API Key + Secret Key(自动获取token)
bash
BAIDU
APIKEY=您的API Key(非bce-v3格式)
BAIDU
SECRETKEY=您的Secret Key
方式2:直接使用access_token(以1.开头)
bash
BAIDU
APIKEY=1.a6b7dbd428f731035f771b8d
无需BAIDUSECRETKEY
方式3:IAM Key(以bce-v3/开头)
bash
BAIDU
APIKEY=bce-v3/ALTAK-8h6t5Y7uI9o0P1q3W2e4R5t6Y7u8I9o0P
无需BAIDUSECRETKEY
技能脚本会自动检测密钥格式并选择相应的认证方式。如果未设置,将提示用户。
使用方法
1. 直接脚本调用(命令行)
bash
单个对话文件合成
python ~/.openclaw/skills/baidu-speech-synthesis/scripts/baidu_tts.py \
--input dialogue.txt \
--output conversation.mp3
指定音色映射(角色名 → 音色代码)
python scripts/baidu_tts.py \
--input script.txt \
--map 小明:1 小红:0 老师:106
批量处理目录下所有.txt文件
python scripts/baidu_tts.py \
--dir ./dialogues \
--format mp3
调整参数
python scripts/baidu_tts.py \
--input text.txt \
--spd 7 --pit 6 --vol 5 \
--aue 3
2. 在OpenClaw会话中使用
当用户触发上述短语时,技能将:
- 1. 检查环境变量配置
- 询问或自动识别输入文本/文件
- 根据默认或指定的音色分配方案生成SSML
- 调用百度API并返回音频文件(可自动播放或保存)
文件结构
baidu-speech-synthesis/
├── SKILL.md # 本文件
├── scripts/
│ ├── baidu_tts.py # 主API客户端(token获取、SSML请求、分段合并)
│ ├── dialogue_formatter.py # 对话文本 → SSML转换和音色映射
│ └── audio_merger.py # ffmpeg音频合并工具(分段合并方案)
└── references/
├── voice_list.md # 音色代码表、示例、推荐搭配
├── ssml_guide.md # 百度SSML标签、限制、示例
└── api_setup.md # 如何获取密钥、免费配额(每月500万字符)、认证详情
技术要点
- - 智能模式选择:自动检测多音色需求,默认使用分段合成模式(百度API仅支持单音色SSML)。
- 分段合成方案:将多角色对话拆分为单音色片段 → 分别合成 → 使用ffmpeg合并(解决API限制,兼容Python 3.13)。
- SSML单音色支持:支持单音色SSML(tex_type=3),用于单个角色的复杂语音表达。
- 自动音色分配:默认映射老王→度逍遥(3),张经理→度小宇(1),小李→度丫丫(4),可通过--map自定义。
- 错误处理:对网络超时、配额耗尽、音频合并失败等情况提供友好提示。
注意事项
- - 免费配额:百度语音合成提供每月500万字符免费配额(2026年最新政策),超出部分按量计费。
- 认证方式:支持三种认证方式(API Key+Secret Key、accesstoken、IAM Key),技能自动检测。
- SSML限制:SSML文本长度限制为1024字节(注意中文字符数),建议每句不超过120个字符。
- 依赖项:分段合并方案需要安装ffmpeg(技能会检测并提示)。无需安装pydub。
- 语音表现力:百度基础音色较为平淡,建议通过文本优化(添加语气词、情感描述)增强对话表现力。
- 密钥安全:请勿在代码中硬编码API密钥,始终使用环境变量或.env文件。
- 错误处理:认证失败时提供详细指导,可参考references/apisetup.md获取帮助。
更新日志
- - 2026‑03‑31 (v1.2.3):修复audiomerger.py中的裸except:语句;替换为正确的异常处理,改进调试和错误可见性。
- 2026‑03‑26 (v1.2.2):添加MIT许可证文件;更新元数据声明ffmpeg依赖;解决ClawHub安全警告。
- 2026‑03‑26 (v1.2.1):技能文档完整英文翻译;改进中英文用户的双语触发词。
- 2026‑03‑26 (v1.2):切换到ffmpeg替代pydub,解决Python 3.13兼容性问题;修正百度API限制描述(仅支持单音色SSML);优化文档和默认音色映射。
- 2026‑03‑26 (v1.1):增强认证支持,添加IAM Key和直接accesstoken认证,更新免费配额信息,改进错误指导。
- 2026‑03‑26 (v1.0):初始版本,支持多说话人对话合成,SSML/分段合并双模式。