AudioClaw Skills Voice Intake

When to use

Use this skill when the user sends a voice message and AudioClaw should understand the content before replying.

Common triggers:

- A Feishu or chat bot receives an audio message instead of text.
AudioClaw needs a transcript plus a clean user message payload.
The workflow wants richer ASR features such as timestamps, sentiment, or speaker separation.
The team wants one stable AudioClaw intake entrypoint instead of hand-written ASR requests.
The channel stores inbound voice files as .ogg or .opus, and AudioClaw still needs one stable ASR path.

Do not use this skill for speech output. Use $audioclaw-skills-voice-reply for TTS.

Workflow

1. Save the incoming audio file locally.
Run scripts/openclaw_voice_intake.py with the audio path.
Let the script choose the best model when no model is forced:

- sense-asr-deepthink for normal single-speaker voice understanding - sense-asr when a language hint is provided - sense-asr-pro when timestamps, sentiment, speaker diarization, or punctuation are requested - sense-asr-lite when hotwords are requested

4. Use the JSON manifest it returns as the AudioClaw handoff:

- transcript.normalized_text - openclaw.turn_payload - routing.selected_model

5. If understanding.clarification_needed is true, ask the user to repeat or resend the audio.

Runtime model

Official HTTP ASR API:

- Endpoint: INLINECODE13
Content type: INLINECODE14
File size limit: INLINECODE15
Practical local input suffixes accepted by this skill: .wav, .mp3, .ogg, .opus, .flac, .aac, .m4a, INLINECODE23

Supported response goals:

- plain transcript
richer raw response passthrough
AudioClaw-ready turn payload

The skill keeps two layers separate:

- ASR output from AudioClaw ASR
AudioClaw packaging and clarification heuristics

API key lookup

This skill now treats SENSEAUDIO_API_KEY as the default API key source again.

Runtime rules:

- If the host app injects SENSEAUDIO_API_KEY as an AudioClaw login token such as v2.public..., the shared bootstrap will replace it with the real sk-... value from ~/.audioclaw/workspace/state/senseaudio_credentials.json before ASR starts.
INLINECODE29 still works, but the default runtime path is SENSEAUDIO_API_KEY.

Commands

Basic voice intake:

CODEBLOCK0

Voice intake with richer AudioClaw structure:

CODEBLOCK1

Force a specific model:

CODEBLOCK2

AudioClaw integration pattern

Recommended handoff:

1. Channel adapter stores the inbound audio.
AudioClaw calls scripts/openclaw_voice_intake.py.
AudioClaw reads:

- openclaw.turn_payload.role - openclaw.turn_payload.content - openclaw.turn_payload.metadata

4. The normal dialogue pipeline continues as if the user typed the recognized text.

Operational rules:

- Keep the original audio path in metadata for debugging.
Pass language only when you are confident; otherwise let ASR auto-detect.
If you request timestamps, sentiment, or diarization, let the script choose sense-asr-pro.
If transcript is empty, do not hallucinate a user intent. Ask for clarification.

Resources

- INLINECODE37

- Multipart HTTP client for AudioClaw ASR - Handles model routing validation and JSON or text responses

- INLINECODE38

- Main runtime for AudioClaw - Builds transcript, normalized user text, and turn payload

- INLINECODE39

- Official ASR docs summary, model support notes, and AudioClaw payload examples

AudioClaw 技能语音输入

使用时机

当用户发送语音消息且 AudioClaw 需要在回复前理解内容时使用此技能。

常见触发场景：

- 飞书或聊天机器人收到语音消息而非文本
AudioClaw 需要转录文本及干净的用户消息负载
工作流需要更丰富的 ASR 功能，如时间戳、情感或说话人分离
团队希望使用稳定的 AudioClaw 语音输入入口，而非手写 ASR 请求
频道将传入的语音文件存储为 .ogg 或 .opus 格式，AudioClaw 仍需要稳定的 ASR 路径

请勿将此技能用于语音输出。TTS 请使用 $audioclaw-skills-voice-reply。

工作流程

1. 将传入的音频文件保存到本地
使用音频路径运行 scripts/openclawvoiceintake.py
当未强制指定模型时，让脚本选择最佳模型：

- sense-asr-deepthink：用于普通单人语音理解 - sense-asr：当提供语言提示时使用 - sense-asr-pro：当需要时间戳、情感、说话人分离或标点符号时使用 - sense-asr-lite：当需要热词时使用

4. 使用脚本返回的 JSON 清单作为 AudioClaw 交接数据：

- transcript.normalized_text - openclaw.turn_payload - routing.selected_model

5. 如果 understanding.clarification_needed 为 true，请让用户重复或重新发送音频

运行时模型

官方 HTTP ASR API：

- 端点：https://api.senseaudio.cn/v1/audio/transcriptions
内容类型：multipart/form-data
文件大小限制：<=10MB
此技能支持的实际本地输入后缀：.wav、.mp3、.ogg、.opus、.flac、.aac、.m4a、.mp4

支持的响应目标：

- 纯文本转录
更丰富的原始响应透传
AudioClaw 就绪的对话负载

此技能保持两个层分离：

- AudioClaw ASR 的 ASR 输出
AudioClaw 打包和澄清启发式逻辑

API 密钥查找

此技能现在再次将 SENSEAUDIOAPIKEY 视为默认 API 密钥来源。

运行时规则：

- 如果宿主应用注入的 SENSEAUDIOAPIKEY 是 AudioClaw 登录令牌（如 v2.public...），共享引导程序将在 ASR 开始前将其替换为 ~/.audioclaw/workspace/state/senseaudiocredentials.json 中的真实 sk-... 值
--api-key-env 仍然可用，但默认运行时路径为 SENSEAUDIOAPI_KEY

命令

基础语音输入：

bash
python3 scripts/openclawvoiceintake.py \
--input /path/to/user_audio.mp3

带更丰富 AudioClaw 结构的语音输入：

bash
python3 scripts/openclawvoiceintake.py \
--input /path/to/meeting_clip.m4a \
--enable-punctuation \
--timestamp-granularity segment \
--enable-sentiment \
--out-json /tmp/openclawvoiceturn.json

强制指定特定模型：

bash
python3 scripts/openclawvoiceintake.py \
--input /path/to/user_audio.mp3 \
--model sense-asr-deepthink

AudioClaw 集成模式

推荐交接流程：

1. 频道适配器存储传入的音频
AudioClaw 调用 scripts/openclawvoiceintake.py
AudioClaw 读取：

- openclaw.turn_payload.role - openclaw.turn_payload.content - openclaw.turn_payload.metadata

4. 正常对话流程继续，如同用户键入了识别出的文本

操作规则：

- 在元数据中保留原始音频路径以便调试
仅在确定时传递 language；否则让 ASR 自动检测
如果需要时间戳、情感或说话人分离，让脚本选择 sense-asr-pro
如果转录文本为空，不要臆测用户意图。请求澄清

资源

- scripts/senseaudioasrclient.py

- AudioClaw ASR 的多部分 HTTP 客户端 - 处理模型路由验证和 JSON 或文本响应

- scripts/openclawvoiceintake.py

- AudioClaw 的主要运行时 - 构建转录文本、规范化用户文本和对话负载

- references/openclawvoiceintake.md

- 官方 ASR 文档摘要、模型支持说明和 AudioClaw 负载示例

audioclaw-skills-voice-intake语音输入处理