AudioClaw Skills Voice Reply
When to use
Use this skill when AudioClaw already has the final reply text and now needs a voice version that can change on demand.
Common triggers:
- - A chat bot should answer in different tones such as
calm, warm, cheerful, serious, or promo. - The caller wants to switch
voice_id or voice_family in a single request without editing code. - The same AudioClaw workflow should support both free voices and paid or custom voices.
- The runtime should keep working even when a requested voice is not available to the current key.
- A user has already said "以后一直给我发语音", and AudioClaw should remember that preference for future turns.
- AudioClaw needs a workspace-local file plus a stable way to deliver it as a Feishu audio message.
- The caller already has a cloned AudioClaw
voice_id such as vc-... and wants the runtime to use it directly without falling back first.
Do not use this skill for ASR intake or long-form digest generation.
Workflow
- 1. Start from the final user-facing text. Do not pass hidden reasoning or raw markdown tables.
- Build an AudioClaw voice request with:
-
text
- optional
scene
- optional
voice_id
- optional
voice_family
- optional
emotion
- optional
speed,
pitch,
volume
- 3. Run
scripts/openclaw_voice_switchboard.py.
- For AudioClaw on Feishu or Lark, prefer
scripts/picoclaw_voice_reply.py.
- 4. Let the script resolve the request in this order:
- exact
voice_id
-
voice_family plus matching emotion variant
- scene plus emotion preset
- validated fallback voice
- Important: if the exact
voice_id is a clone-style id such as
vc-..., this skill now tries that id directly first, even if it is not part of the built-in official voice catalog.
- 5. If
preference_key is present, the script can remember:
-
reply_mode
- default
voice_id
- default
emotion
- default
scene
- 6. If the result should be sent through AudioClaw media upload, pass:
-
--out inside the AudioClaw workspace
-
--openclaw-workspace-root pointing at the workspace root
-
--delivery-profile feishu_voice when the downstream channel prefers
.ogg/.opus
- optional
--chmod 644 if you want to be explicit, though this skill now defaults to
0644
- if
--openclaw-workspace-root is set and
--out is omitted, this skill now writes to
workspace/state/audio/ automatically
- 7. Use the returned JSON manifest in AudioClaw to:
- prefer
scripts/picoclaw_voice_reply.py for AudioClaw on Feishu
- let the wrapper send the Feishu audio message directly
- do not send the local path or the
MEDIA:... line through the
message tool
- only use
send_file when you intentionally opt out of direct Feishu sending
- log
trace_id
- persist the resolved voice choice for the next turn if desired
- 8. If the requested voice is not available to the current key, let the fallback stand unless the user explicitly requires strict failure.
- If you want AudioClaw to remember a clone voice, either:
- set it as the user's default voice with
--set-default-voice-id vc-...
- or register it explicitly with INLINECODE43
Voice discovery
When AudioClaw needs to find a usable voice or confirm a voice_id, use this order:
- 1. Check the local catalog first:
- run
python3 scripts/openclaw_voice_switchboard.py --list-voices
- use this for fast lookup of built-in voices, known emotion variants, and clone voices already registered locally
- 2. If the user asks for the official public voice list, package availability, or a voice not found locally, check the official voice page:
-
https://senseaudio.cn/docs/voice_api
- page title:
API 音色服务说明
- 3. Prefer the official page when AudioClaw needs to confirm:
- whether a
voice_id is a free, VIP, or SVIP voice
- whether a named speaker has multiple emotion variants
- whether the request likely needs selected-voice purchase or custom voice authorization
- 4. After finding a likely
voice_id, still let the runtime validate access at synthesis time, because account permissions can differ by key.
Practical rule:
- - Local
--list-voices is the first-stop runtime catalog. - INLINECODE51 is the canonical fallback reference for official voice names,
voice_id, and package tier notes.
AudioClaw rule
When this skill is used inside AudioClaw for Feishu or Lark voice replies:
- 1. Run
scripts/picoclaw_voice_reply.py. - Let the wrapper upload the generated
.ogg/.opus file to Feishu and send it as msg_type=audio. - Do not call the
send_file tool for that audio unless you explicitly passed --skip-direct-send. - Do not call the
message tool with the local path or the MEDIA:... reference. AudioClaw will send them as plain text. - After the audio is sent, prefer no extra text confirmation.
- If the host runtime still requires one final assistant message to finish the turn, send one short natural Chinese line such as
我已经用语音回复你了。 instead of leaving the turn empty. - Use
media_reference only as debug metadata or future AudioClaw compatibility data.
This rule matters because this AudioClaw environment does not render MEDIA:... as media, and the generic send_file tool sends Feishu voice notes as plain files instead of audio messages. The reliable path here is direct Feishu upload plus msg_type=audio.
Runtime model
The official public TTS API exposes:
- - INLINECODE66
- INLINECODE67
- INLINECODE68
- INLINECODE69
- INLINECODE70
- INLINECODE71
- one HTTP endpoint with two modes:
- non-stream with
stream=false
- SSE with INLINECODE73
Important constraint:
- - The public TTS API docs do not expose a standalone
emotion request field. - Emotion switching is therefore handled by choosing a matching
voice_id when one exists, or by keeping the voice and shaping speed / pitch / vol. - This skill requests final-file TTS in non-stream mode by default, because AudioClaw only needs the completed file and this avoids stream assembly edge cases.
- For this server-side HTTP TTS path, the official docs still use
Authorization: Bearer API_KEY. The generated Public Key is not required by this skill. - If the requested
voice_id looks like a clone id such as vc-..., this skill now auto-routes TTS to SenseAudio-TTS-1.5 and records audio.model_used in the manifest.
API key lookup
This skill now treats SENSEAUDIO_API_KEY as the default API key source again.
Runtime rules:
- - If the host app injects
SENSEAUDIO_API_KEY as an AudioClaw login token such as v2.public..., the shared bootstrap will replace it with the real sk-... value from ~/.audioclaw/workspace/state/senseaudio_credentials.json before TTS starts. - INLINECODE88 still works, but the default runtime path is
SENSEAUDIO_API_KEY.
If you need the exact same speaker timbre across many emotions, use a purchased multi-variant voice family or an authorized custom voice. Otherwise this skill will approximate the requested emotion with the best available voice and tuning.
Request contract
Minimal JSON request:
CODEBLOCK0
Full request:
CODEBLOCK1
Clone voice example:
CODEBLOCK2
Supported emotion presets:
- - INLINECODE90
- INLINECODE91
- INLINECODE92
- INLINECODE93
- INLINECODE94
- INLINECODE95
- INLINECODE96
- INLINECODE97
- INLINECODE98
Supported scene hints:
- - INLINECODE99
- INLINECODE100
- INLINECODE101
- INLINECODE102
- INLINECODE103
- INLINECODE104
- INLINECODE105
- INLINECODE106
- INLINECODE107
AudioClaw integration pattern
Recommended handoff:
- 1. AudioClaw generates final reply text.
- AudioClaw decides whether this turn should speak and what mood it wants.
- AudioClaw calls
scripts/openclaw_voice_switchboard.py with a request JSON. - The script returns a manifest with:
- requested voice
- resolved voice
- emotion strategy
- effective
speed / pitch / volume
- delivery profile and file mode
- local audio path
- AudioClaw-friendly media reference when the output is under a workspace root
-
trace_id
- 5. AudioClaw uploads the resulting file to Feishu or any downstream channel, or lets the AudioClaw wrapper do that directly for Feishu.
Operational rules:
- - Cache by resolved voice plus text plus audio settings.
- If a paid voice is unavailable, allow fallback unless the request is marked strict.
- Always log the resolved voice and
trace_id. - Force generated audio files to mode
0644 so the AudioClaw sender can read them reliably. - When
--openclaw-workspace-root is set and --out stays inside that root, expose delivery.openclaw_media_reference. - When
--delivery-profile feishu_voice is enabled, synthesize with AudioClaw first and then transcode to ogg/opus with system ffmpeg or imageio-ffmpeg. - This publishable skill intentionally does not bundle
ffmpeg. Install ffmpeg or run python3 -m pip install imageio-ffmpeg on the target machine. - Avoid ad-hoc temp filenames under the workspace root. Prefer
workspace/state/audio/, which this skill will now use automatically when --openclaw-workspace-root is given without --out. - For AudioClaw on Feishu,
scripts/picoclaw_voice_reply.py now uses the local Feishu app credentials from ~/.audioclaw/config.json, uploads the audio through the official /open-apis/im/v1/files endpoint, and sends it as msg_type=audio. - The wrapper infers the active Feishu
chat_id from the latest agent_main_feishu_direct_*.jsonl session log unless you pass --chat-id explicitly.
Commands
List voices:
CODEBLOCK3
Check the official voice catalog page:
CODEBLOCK4
List emotion presets:
CODEBLOCK5
Enable permanent voice reply for one user:
CODEBLOCK6
Enable permanent cloned-voice reply for one user:
CODEBLOCK7
Register a prepared clone voice so the runtime can list and reuse it:
CODEBLOCK8
Show registered clone voices:
CODEBLOCK9
Show current voice preference:
CODEBLOCK10
Generate one AudioClaw turn from a JSON request:
CODEBLOCK11
Direct CLI example:
CODEBLOCK12
AudioClaw Feishu one-step example:
CODEBLOCK13
Only generate, do not send:
CODEBLOCK14
Resources
- Small importable client for
https://api.senseaudio.cn/v1/t2a_v2
- Handles SSE chunks and writes audio bytes
- TTS capability summary plus the official voice catalog reference at
https://senseaudio.cn/docs/voice_api
- Main runtime for AudioClaw
- Resolves voice, emotion, fallback, caching, and output manifest
- AudioClaw-first wrapper
- Generates audio and, by default, sends it to Feishu as a real
audio message
- Direct Feishu sender for
.ogg/.opus
- Uses
~/.audioclaw/config.json app credentials by default, infers the active chat, uploads the file, and sends
msg_type=audio
- Official docs summary, request examples, and AudioClaw capability boundaries
AudioClaw 技能 语音回复
使用时机
当 AudioClaw 已有最终回复文本,且需要可按需变更的语音版本时使用此技能。
常见触发场景:
- - 聊天机器人需要用不同语气回复,如 平静、温暖、愉快、严肃 或 促销
- 呼叫者希望在单次请求中切换 voiceid 或 voicefamily,无需修改代码
- 同一 AudioClaw 工作流需同时支持免费语音和付费或自定义语音
- 当请求的语音对当前密钥不可用时,运行时仍能正常工作
- 用户已说过以后一直给我发语音,AudioClaw 需在后续轮次记住此偏好
- AudioClaw 需要工作区本地文件,以及将其作为飞书语音消息发送的稳定方式
- 呼叫者已有克隆的 AudioClaw voice_id(如 vc-...),希望运行时直接使用,不先回退
请勿将此技能用于 ASR 输入或长文本摘要生成。
工作流程
- 1. 从最终面向用户的文本开始。不传递隐藏推理或原始 markdown 表格。
- 构建 AudioClaw 语音请求,包含:
- text
- 可选 scene
- 可选 voice_id
- 可选 voice_family
- 可选 emotion
- 可选 speed、pitch、volume
- 3. 运行 scripts/openclawvoiceswitchboard.py
- 对于飞书或 Lark 上的 AudioClaw,优先使用 scripts/picoclaw
voicereply.py
- 4. 让脚本按以下顺序解析请求:
- 精确匹配 voice_id
- voice_family 加匹配的情绪变体
- 场景加情绪预设
- 已验证的回退语音
- 重要:如果精确的 voice_id 是克隆类型 ID(如 vc-...),此技能现在会先直接尝试该 ID,即使它不在内置官方语音目录中
- 5. 如果存在 preference_key,脚本可记住:
- reply_mode
- 默认 voice_id
- 默认 emotion
- 默认 scene
- 6. 如果结果需通过 AudioClaw 媒体上传发送,传递:
- --out 在 AudioClaw 工作区内
- --openclaw-workspace-root 指向工作区根目录
- --delivery-profile feishu_voice(当下游通道偏好 .ogg/.opus 时)
- 可选 --chmod 644(如需明确指定,此技能现在默认使用 0644)
- 如果设置了 --openclaw-workspace-root 但省略了 --out,此技能现在会自动写入 workspace/state/audio/
- 7. 在 AudioClaw 中使用返回的 JSON 清单:
- 对于飞书上的 AudioClaw,优先使用 scripts/picoclaw
voicereply.py
- 让包装器直接发送飞书音频消息
- 不通过 message 工具发送本地路径或 MEDIA:... 行
- 仅在有意选择不直接发送飞书时使用 send_file
- 记录 trace_id
- 如需,保留已解析的语音选择供下一轮使用
- 8. 如果请求的语音对当前密钥不可用,让回退生效,除非用户明确要求严格失败。
- 如果希望 AudioClaw 记住克隆语音,可:
- 使用 --set-default-voice-id vc-... 将其设置为用户的默认语音
- 或使用 --register-clone-voice-id vc-... 显式注册
语音发现
当 AudioClaw 需要找到可用语音或确认 voice_id 时,按此顺序操作:
- 1. 先检查本地目录:
- 运行 python3 scripts/openclaw
voiceswitchboard.py --list-voices
- 用于快速查找内置语音、已知情绪变体和已本地注册的克隆语音
- 2. 如果用户询问官方公开语音列表、套餐可用性或本地未找到的语音,检查官方语音页面:
- https://senseaudio.cn/docs/voice_api
- 页面标题:API 音色服务说明
- 3. 当 AudioClaw 需要确认以下内容时,优先使用官方页面:
- voice_id 是免费、VIP 还是 SVIP 语音
- 指定说话人是否有多个情绪变体
- 请求是否需要精选语音购买或自定义语音授权
- 4. 找到可能的 voice_id 后,仍让运行时在合成时验证访问权限,因为账户权限可能因密钥而异。
实用规则:
- - 本地 --list-voices 是运行时的首选目录。
- https://senseaudio.cn/docs/voiceapi 是官方语音名称、voiceid 和套餐等级说明的权威参考。
AudioClaw 规则
当此技能在 AudioClaw 中用于飞书或 Lark 语音回复时:
- 1. 运行 scripts/picoclawvoicereply.py
- 让包装器将生成的 .ogg/.opus 文件上传到飞书,并以 msgtype=audio 发送
- 除非显式传递了 --skip-direct-send,否则不调用该音频的 sendfile 工具
- 不使用本地路径或 MEDIA:... 引用调用 message 工具。AudioClaw 会将其作为纯文本发送
- 音频发送后,优先不发送额外文本确认
- 如果宿主运行时仍需要一条最终助手消息来完成本轮,发送一句简短的自然中文,如我已经用语音回复你了。,而不是让本轮为空
- 仅将 media_reference 用作调试元数据或未来 AudioClaw 兼容性数据
此规则很重要,因为此 AudioClaw 环境不会将 MEDIA:... 渲染为媒体,而通用的 sendfile 工具会将飞书语音笔记作为普通文件而非 audio 消息发送。这里的可靠路径是直接飞书上传加 msgtype=audio。
运行时模型
官方公开 TTS API 暴露:
- - voicesetting.voiceid
- voicesetting.speed
- voicesetting.vol
- voicesetting.pitch
- audiosetting.format
- audiosetting.samplerate
- 一个 HTTP 端点,两种模式:
- 非流式,stream=false
- SSE,stream=true
重要约束:
- - 公开 TTS API 文档未暴露独立的 emotion 请求字段
- 因此情绪切换通过选择匹配的 voiceid(如存在)或保持语音并调整 speed / pitch / vol 来处理
- 此技能默认以非流式模式请求最终文件 TTS,因为 AudioClaw 只需要完成的文件,这避免了流式组装边缘情况
- 对于此服务端 HTTP TTS 路径,官方文档仍使用 Authorization: Bearer APIKEY。此技能不需要生成的 Public Key
- 如果请求的 voiceid 看起来像克隆 ID(如 vc-...),此技能现在会自动将 TTS 路由到 SenseAudio-TTS-1.5,并在清单中记录 audio.modelused
API 密钥查找
此技能现在再次将 SENSEAUDIOAPIKEY 视为默认 API 密钥来源。
运行时规则:
- - 如果宿主应用注入的 SENSEAUDIOAPIKEY 是 AudioClaw 登录令牌(如 v2.public...),共享引导程序会在 TTS 开始前将其替换为 ~/.audioclaw/workspace/state/senseaudiocredentials.json 中的真实 sk-... 值
- --api-key-env 仍然有效,但默认运行时路径是 SENSEAUDIOAPI_KEY
如果需要在多种情绪下使用完全相同的说话人音色,请使用购买的多变体语音系列或授权的自定义语音。否则此技能将使用最佳可用语音和调音来近似请求的情绪。
请求契约
最小 JSON 请求:
json
{
text: 我们已经收到你的需求,今天下午会把结果发给你。,
scene: assistant,
emotion: calm
}
完整请求:
json
{
text: 新品今晚八点开售,现在下单还有首发赠品。,
scene: sales,
voiceid: male0027_b,
voicefamily: male0027,
emotion: promo,
speed: 1.08,
pitch: 1,
volume: 1.05,
audio_format: mp3,
sample_rate: 32000,
preference_key: