Local STT Workflow

Use this skill to debug the full transcription path, not just the model.

Default assumption: the local STT server lives at http://127.0.0.1:8000/v1.

Current local model-path fallback worth remembering: if the server did not pull a model by name, it may be loading directly from a local path such as ./models/Qwen3-ASR-0.6B-bf16.

When exact route shape matters, the local OpenAPI document is available at:

- INLINECODE2

Use this OpenAPI doc as a schema/reference source to compare this local mlx-audio server against OpenAI’s API. Do not treat it as a health check.

Workflow

1. Verify the server before blaming OpenClaw

Check the basics first:

CODEBLOCK0

Confirm that the intended STT model exists, usually qwen3-asr.

If the model does not appear by pulled registry name, do not assume STT is broken — this server may be running a local-path model such as ./models/Qwen3-ASR-0.6B-bf16.

If the server is task-gated, ensure STT is enabled:

CODEBLOCK1

If the model is missing, register it before testing clients — but first check whether the server is intentionally loading from a local path and verify the exact accepted model IDs through /v1/models or http://localhost:8000/openapi.json.

2. Prove the raw STT endpoint works

Always isolate the server from the client stack.

Minimal direct transcription test:

CODEBLOCK2

Useful richer test:

CODEBLOCK3

If direct curl works but OpenClaw does not, the bug is probably in the message ingestion or routing layer, not the STT backend.

3. Distinguish server failure from routing failure

Use this rule hard:

- Direct curl fails → fix the local STT server first
Direct curl works, but OpenClaw shows no transcript → inspect OpenClaw audio pipeline / attachment routing
OpenClaw sends requests, but fields are wrong → inspect request shape compatibility

This distinction saves a shitload of time.

4. Check the request shape

This server is designed around OpenAI-style multipart form upload.

Expected core fields for /v1/audio/transcriptions from the current local OpenAPI schema:

- required: file, INLINECODE11
optional: language, verbose, max_tokens, chunk_duration, frame_threshold, stream, context, prefill_step_size, INLINECODE20

This means the local server is not exposing the same form shape as OpenAI Whisper-style docs. Do not blindly assume response_format, prompt, or timestamp_granularities[] exist just because OpenAI supports them.

If a client is suspected of sending the wrong shape, inspect traffic with a temporary dump proxy or server logs.

5. Use the reference doc when exact fields matter

Read references/stt-api.md when you need exact behavior for:

- INLINECODE25
INLINECODE26 SSE events
INLINECODE27
INLINECODE28
translation endpoint semantics
error envelope shape
current compatibility limits

Do not guess field support from generic OpenAI docs when this local server may intentionally differ.

Current notable mismatch: the local schema exposes context and text, plus chunking/prefill controls like chunk_duration, frame_threshold, and prefill_step_size, which are not the usual OpenAI STT field set.

6. OpenClaw-specific debugging pattern

When OpenClaw STT appears broken:

1. Confirm tools.media.audio is configured, not INLINECODE35
Confirm base URL points at INLINECODE36
Confirm the chosen model exists in INLINECODE37
Send the exact inbound audio file directly to INLINECODE38
Inspect gateway logs for any sign of transcription dispatch
If there is no /audio/transcriptions request at all, the problem is upstream of STT

If OpenClaw never hits the server, stop tweaking model params. That would be cargo-cult debugging.

7. Preferred test ladder

Use this order:

1. INLINECODE40
INLINECODE41
direct curl transcription with the same audio file
compare request fields against INLINECODE43
OpenAI client compatibility test
OpenClaw integration test
dump-proxy / log inspection only if still ambiguous

8. Common conclusions

Niche input container bug

Typical signs:

- direct upload of a less-common container like .m4a returns INLINECODE45
server logs mention unsupported format handling during temp write or normalization
converting the same source audio to mp3 or wav makes transcription succeed immediately

Conclusion: treat this as an input-container compatibility bug, not an ASR-quality failure. For now, transcode niche formats to mp3 or wav before testing recognition quality.

Server good, client bad

Typical signs:

- manual curl returns INLINECODE51
OpenClaw logs show no transcription request
changing model/language does nothing

Conclusion: fix routing, not inference.

Multipart mismatch

Typical signs:

- server is up
model exists
client gets 400 errors
direct curl works but app client does not

Conclusion: compare multipart field names and values.

Feature mismatch

Typical signs:

- client expects diarization, logprobs, or richer streaming fields
local server only implements a smaller compatible subset

Conclusion: align expectations with references/stt-api.md.

Resources

references/

- references/stt-api.md — exact local API behavior, schema, response formats, SSE events, limits, and compatibility notes

本地语音转文本工作流

使用此技能调试完整转录路径，而不仅仅是模型。

默认假设：本地STT服务器位于 http://127.0.0.1:8000/v1。

当前值得记住的本地模型路径回退：如果服务器未按名称拉取模型，它可能直接从本地路径加载，例如 ./models/Qwen3-ASR-0.6B-bf16。

当精确路由形状很重要时，本地OpenAPI文档位于：

- http://localhost:8000/openapi.json

使用此OpenAPI文档作为模式/参考源，将此本地 mlx-audio 服务器与OpenAI的API进行比较。不要将其视为健康检查。

工作流

1. 在归咎于OpenClaw之前先验证服务器

首先检查基础项：

bash
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

确认预期的STT模型存在，通常是 qwen3-asr。

如果模型未按拉取的注册表名称出现，不要假设STT已损坏——此服务器可能正在运行本地路径模型，例如 ./models/Qwen3-ASR-0.6B-bf16。

如果服务器受任务限制，请确保STT已启用：

bash
MLXAUDIOSERVER_TASKS=stt uv run python server.py

如果模型缺失，请在测试客户端之前注册它——但首先检查服务器是否故意从本地路径加载，并通过 /v1/models 或 http://localhost:8000/openapi.json 验证确切的已接受模型ID。

2. 证明原始STT端点正常工作

始终将服务器与客户端堆栈隔离。

最小直接转录测试：

bash
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
-F file=@sample.wav \
-F model=qwen3-asr \
-F response_format=json

有用的更丰富测试：

bash
curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
-F file=@sample.wav \
-F model=qwen3-asr \
-F responseformat=verbosejson \
-F timestamp_granularities[]=segment \
-F timestamp_granularities[]=word

如果直接 curl 有效但OpenClaw无效，则错误可能出在消息接收或路由层，而不是STT后端。

3. 区分服务器故障与路由故障

严格执行此规则：

- 直接curl失败 → 首先修复本地STT服务器
直接curl有效，但OpenClaw未显示转录 → 检查OpenClaw音频管道/附件路由
OpenClaw发送请求，但字段错误 → 检查请求形状兼容性

这种区分可以节省大量时间。

4. 检查请求形状

此服务器围绕OpenAI风格的多部分表单上传设计。

来自当前本地OpenAPI模式的 /v1/audio/transcriptions 预期核心字段：

- 必需：file、model
可选：language、verbose、maxtokens、chunkduration、framethreshold、stream、context、prefillstep_size、text

这意味着本地服务器未暴露与OpenAI Whisper风格文档相同的表单形状。不要盲目假设 responseformat、prompt 或 timestampgranularities[] 存在，仅仅因为OpenAI支持它们。

如果怀疑客户端发送了错误的形状，请使用临时转储代理或服务器日志检查流量。

5. 当确切字段很重要时使用参考文档

当需要以下确切行为时，阅读 references/stt-api.md：

- responseformat=json|text|verbosejson|srt|vtt
stream=true SSE事件
timestamp_granularities[]
include[]
翻译端点语义
错误信封形状
当前兼容性限制

当此本地服务器可能有意不同时，不要从通用OpenAI文档猜测字段支持。

当前值得注意的不匹配：本地模式暴露了 context 和 text，以及分块/预填充控制，如 chunkduration、framethreshold 和 prefillstepsize，这些不是通常的OpenAI STT字段集。

6. OpenClaw特定的调试模式

当OpenClaw STT似乎损坏时：

1. 确认 tools.media.audio 已配置，而不是 messages.stt
确认基础URL指向 http://127.0.0.1:8000/v1
确认所选模型存在于 /v1/models 中
将确切的入站音频文件直接发送到 /v1/audio/transcriptions
检查网关日志中是否有任何转录分派的迹象
如果完全没有 /audio/transcriptions 请求，则问题出在STT上游

如果OpenClaw从未命中服务器，停止调整模型参数。那将是盲目的调试。

7. 首选测试阶梯

按此顺序使用：

1. GET /health
GET /v1/models
使用相同音频文件直接 curl 转录
将请求字段与 http://localhost:8000/openapi.json 比较
OpenAI客户端兼容性测试
OpenClaw集成测试
仅在仍然不明确时进行转储代理/日志检查

8. 常见结论

小众输入容器错误

典型迹象：

- 直接上传不太常见的容器如 .m4a 返回 500
服务器日志提到临时写入或规范化期间不支持的格式处理
将相同源音频转换为 mp3 或 wav 使转录立即成功

结论：将其视为输入容器兼容性错误，而不是ASR质量故障。目前，在测试识别质量之前，将小众格式转码为 mp3 或 wav。

服务器正常，客户端异常

典型迹象：

- 手动 curl 返回 { text: ... }
OpenClaw日志显示无转录请求
更改模型/语言无效果

结论：修复路由，而不是推理。

多部分不匹配

典型迹象：

- 服务器已启动
模型存在
客户端收到400错误
直接 curl 有效但应用客户端无效

结论：比较多部分字段名称和值。

功能不匹配

典型迹象：

- 客户端期望说话人分离、对数概率或更丰富的流式字段
本地服务器仅实现较小的兼容子集

结论：使期望与 references/stt-api.md 对齐。

资源

references/

- references/stt-api.md — 确切的本地API行为、模式、响应格式、SSE事件、限制和兼容性说明

local-stt-workflow本地语音转文字