Local STT Workflow
Use this skill to debug the full transcription path, not just the model.
Default assumption: the local STT server lives at http://127.0.0.1:8000/v1.
Current local model-path fallback worth remembering: if the server did not pull a model by name, it may be loading directly from a local path such as ./models/Qwen3-ASR-0.6B-bf16.
When exact route shape matters, the local OpenAPI document is available at:
Use this OpenAPI doc as a schema/reference source to compare this local mlx-audio server against OpenAI’s API. Do not treat it as a health check.
Workflow
1. Verify the server before blaming OpenClaw
Check the basics first:
CODEBLOCK0
Confirm that the intended STT model exists, usually qwen3-asr.
If the model does not appear by pulled registry name, do not assume STT is broken — this server may be running a local-path model such as ./models/Qwen3-ASR-0.6B-bf16.
If the server is task-gated, ensure STT is enabled:
CODEBLOCK1
If the model is missing, register it before testing clients — but first check whether the server is intentionally loading from a local path and verify the exact accepted model IDs through /v1/models or http://localhost:8000/openapi.json.
2. Prove the raw STT endpoint works
Always isolate the server from the client stack.
Minimal direct transcription test:
CODEBLOCK2
Useful richer test:
CODEBLOCK3
If direct curl works but OpenClaw does not, the bug is probably in the message ingestion or routing layer, not the STT backend.
3. Distinguish server failure from routing failure
Use this rule hard:
- - Direct curl fails → fix the local STT server first
- Direct curl works, but OpenClaw shows no transcript → inspect OpenClaw audio pipeline / attachment routing
- OpenClaw sends requests, but fields are wrong → inspect request shape compatibility
This distinction saves a shitload of time.
4. Check the request shape
This server is designed around OpenAI-style multipart form upload.
Expected core fields for /v1/audio/transcriptions from the current local OpenAPI schema:
- - required:
file, INLINECODE11 - optional:
language, verbose, max_tokens, chunk_duration, frame_threshold, stream, context, prefill_step_size, INLINECODE20
This means the local server is not exposing the same form shape as OpenAI Whisper-style docs. Do not blindly assume response_format, prompt, or timestamp_granularities[] exist just because OpenAI supports them.
If a client is suspected of sending the wrong shape, inspect traffic with a temporary dump proxy or server logs.
5. Use the reference doc when exact fields matter
Read references/stt-api.md when you need exact behavior for:
- - INLINECODE25
- INLINECODE26 SSE events
- INLINECODE27
- INLINECODE28
- translation endpoint semantics
- error envelope shape
- current compatibility limits
Do not guess field support from generic OpenAI docs when this local server may intentionally differ.
Current notable mismatch: the local schema exposes context and text, plus chunking/prefill controls like chunk_duration, frame_threshold, and prefill_step_size, which are not the usual OpenAI STT field set.
6. OpenClaw-specific debugging pattern
When OpenClaw STT appears broken:
- 1. Confirm
tools.media.audio is configured, not INLINECODE35 - Confirm base URL points at INLINECODE36
- Confirm the chosen model exists in INLINECODE37
- Send the exact inbound audio file directly to INLINECODE38
- Inspect gateway logs for any sign of transcription dispatch
- If there is no
/audio/transcriptions request at all, the problem is upstream of STT
If OpenClaw never hits the server, stop tweaking model params. That would be cargo-cult debugging.
7. Preferred test ladder
Use this order:
- 1. INLINECODE40
- INLINECODE41
- direct
curl transcription with the same audio file - compare request fields against INLINECODE43
- OpenAI client compatibility test
- OpenClaw integration test
- dump-proxy / log inspection only if still ambiguous
8. Common conclusions
Niche input container bug
Typical signs:
- - direct upload of a less-common container like
.m4a returns INLINECODE45 - server logs mention unsupported format handling during temp write or normalization
- converting the same source audio to
mp3 or wav makes transcription succeed immediately
Conclusion: treat this as an input-container compatibility bug, not an ASR-quality failure. For now, transcode niche formats to mp3 or wav before testing recognition quality.
Server good, client bad
Typical signs:
- - manual
curl returns INLINECODE51 - OpenClaw logs show no transcription request
- changing model/language does nothing
Conclusion: fix routing, not inference.
Multipart mismatch
Typical signs:
- - server is up
- model exists
- client gets 400 errors
- direct
curl works but app client does not
Conclusion: compare multipart field names and values.
Feature mismatch
Typical signs:
- - client expects diarization, logprobs, or richer streaming fields
- local server only implements a smaller compatible subset
Conclusion: align expectations with references/stt-api.md.
Resources
references/
- -
references/stt-api.md — exact local API behavior, schema, response formats, SSE events, limits, and compatibility notes
本地语音转文本工作流
使用此技能调试完整转录路径,而不仅仅是模型。
默认假设:本地STT服务器位于 http://127.0.0.1:8000/v1。
当前值得记住的本地模型路径回退:如果服务器未按名称拉取模型,它可能直接从本地路径加载,例如 ./models/Qwen3-ASR-0.6B-bf16。
当精确路由形状很重要时,本地OpenAPI文档位于:
- - http://localhost:8000/openapi.json
使用此OpenAPI文档作为模式/参考源,将此本地 mlx-audio 服务器与OpenAI的API进行比较。不要将其视为健康检查。
工作流
1. 在归咎于OpenClaw之前先验证服务器
首先检查基础项:
bash
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models
确认预期的STT模型存在,通常是 qwen3-asr。
如果模型未按拉取的注册表名称出现,不要假设STT已损坏——此服务器可能正在运行本地路径模型,例如 ./models/Qwen3-ASR-0.6B-bf16。
如果服务器受任务限制,请确保STT已启用:
bash
MLXAUDIOSERVER_TASKS=stt uv run python server.py
如果模型缺失,请在测试客户端之前注册它——但首先检查服务器是否故意从本地路径加载,并通过 /v1/models 或 http://localhost:8000/openapi.json 验证确切的已接受模型ID。
2. 证明原始STT端点正常工作
始终将服务器与客户端堆栈隔离。
最小直接转录测试:
bash
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
-F file=@sample.wav \
-F model=qwen3-asr \
-F response_format=json
有用的更丰富测试:
bash
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
-F file=@sample.wav \
-F model=qwen3-asr \
-F responseformat=verbosejson \
-F timestamp_granularities[]=segment \
-F timestamp_granularities[]=word
如果直接 curl 有效但OpenClaw无效,则错误可能出在消息接收或路由层,而不是STT后端。
3. 区分服务器故障与路由故障
严格执行此规则:
- - 直接curl失败 → 首先修复本地STT服务器
- 直接curl有效,但OpenClaw未显示转录 → 检查OpenClaw音频管道/附件路由
- OpenClaw发送请求,但字段错误 → 检查请求形状兼容性
这种区分可以节省大量时间。
4. 检查请求形状
此服务器围绕OpenAI风格的多部分表单上传设计。
来自当前本地OpenAPI模式的 /v1/audio/transcriptions 预期核心字段:
- - 必需:file、model
- 可选:language、verbose、maxtokens、chunkduration、framethreshold、stream、context、prefillstep_size、text
这意味着本地服务器未暴露与OpenAI Whisper风格文档相同的表单形状。不要盲目假设 responseformat、prompt 或 timestampgranularities[] 存在,仅仅因为OpenAI支持它们。
如果怀疑客户端发送了错误的形状,请使用临时转储代理或服务器日志检查流量。
5. 当确切字段很重要时使用参考文档
当需要以下确切行为时,阅读 references/stt-api.md:
- - responseformat=json|text|verbosejson|srt|vtt
- stream=true SSE事件
- timestamp_granularities[]
- include[]
- 翻译端点语义
- 错误信封形状
- 当前兼容性限制
当此本地服务器可能有意不同时,不要从通用OpenAI文档猜测字段支持。
当前值得注意的不匹配:本地模式暴露了 context 和 text,以及分块/预填充控制,如 chunkduration、framethreshold 和 prefillstepsize,这些不是通常的OpenAI STT字段集。
6. OpenClaw特定的调试模式
当OpenClaw STT似乎损坏时:
- 1. 确认 tools.media.audio 已配置,而不是 messages.stt
- 确认基础URL指向 http://127.0.0.1:8000/v1
- 确认所选模型存在于 /v1/models 中
- 将确切的入站音频文件直接发送到 /v1/audio/transcriptions
- 检查网关日志中是否有任何转录分派的迹象
- 如果完全没有 /audio/transcriptions 请求,则问题出在STT上游
如果OpenClaw从未命中服务器,停止调整模型参数。那将是盲目的调试。
7. 首选测试阶梯
按此顺序使用:
- 1. GET /health
- GET /v1/models
- 使用相同音频文件直接 curl 转录
- 将请求字段与 http://localhost:8000/openapi.json 比较
- OpenAI客户端兼容性测试
- OpenClaw集成测试
- 仅在仍然不明确时进行转储代理/日志检查
8. 常见结论
小众输入容器错误
典型迹象:
- - 直接上传不太常见的容器如 .m4a 返回 500
- 服务器日志提到临时写入或规范化期间不支持的格式处理
- 将相同源音频转换为 mp3 或 wav 使转录立即成功
结论:将其视为输入容器兼容性错误,而不是ASR质量故障。目前,在测试识别质量之前,将小众格式转码为 mp3 或 wav。
服务器正常,客户端异常
典型迹象:
- - 手动 curl 返回 { text: ... }
- OpenClaw日志显示无转录请求
- 更改模型/语言无效果
结论:修复路由,而不是推理。
多部分不匹配
典型迹象:
- - 服务器已启动
- 模型存在
- 客户端收到400错误
- 直接 curl 有效但应用客户端无效
结论:比较多部分字段名称和值。
功能不匹配
典型迹象:
- - 客户端期望说话人分离、对数概率或更丰富的流式字段
- 本地服务器仅实现较小的兼容子集
结论:使期望与 references/stt-api.md 对齐。
资源
references/
- - references/stt-api.md — 确切的本地API行为、模式、响应格式、SSE事件、限制和兼容性说明