Local Speech-to-Text Transcription

You're helping someone use speech-to-text transcription on audio files — meetings, voice memos, podcast episodes, phone recordings — without sending anything to the cloud. Every audio file stays on their devices. The fleet picks the best node to handle each speech-to-text transcription automatically.

Why local speech-to-text transcription matters

Cloud speech-to-text transcription APIs charge per minute and send your audio to third-party servers. Meeting recordings contain sensitive business discussions. Voice notes contain personal thoughts. Podcast interviews contain unreleased content. None of that should leave your network. Local transcription keeps it private.

This skill routes speech-to-text transcription requests across your fleet of devices. If one machine is busy with a 3-hour transcription, the next speech-to-text request goes to a different device. Transcription queue management, health monitoring, and dashboard visibility — same infrastructure you'd get from a cloud speech-to-text API, running entirely on your hardware.

Get started with speech-to-text transcription

CODEBLOCK0

Enable speech-to-text transcription:

CODEBLOCK1

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Transcribe audio with speech-to-text

curl — basic transcription

CODEBLOCK2

Python — speech-to-text transcription

CODEBLOCK3

Speech-to-text transcription with timestamps

CODEBLOCK4

Transcription response format

CODEBLOCK5

Supported audio formats for transcription

WAV, MP3, M4A, FLAC, MP4, OGG — any format FFmpeg supports. WAV files get a ~25% transcription speed boost via native fast-path.

Speech-to-text transcription response headers

Header	Description
INLINECODE0	Which device performed the speech-to-text transcription
INLINECODE1

Transcription model used (qwen3-asr) | | X-Transcription-Time | Transcription processing time in milliseconds |

Speech-to-text transcription model

Qwen3-ASR — state-of-the-art open-source speech-to-text transcription in 2026. ~5% word error rate, runs natively on Apple Silicon via MLX. The 0.6B transcription model uses ~1.2GB memory and transcribes at 0.08x real-time factor (a 10-minute recording completes transcription in ~48 seconds).

Also available on this fleet

The same router handles three other AI workloads alongside speech-to-text transcription. All endpoints are at http://localhost:11435:

LLM inference

CODEBLOCK6

Image generation

CODEBLOCK7

Embeddings

CODEBLOCK8

Monitoring speech-to-text transcription

CODEBLOCK9

Dashboard at http://localhost:11435/dashboard — speech-to-text transcription queues show with [STT] badge alongside LLM and image queues.

Full documentation

Agent Setup Guide — complete reference for all 4 model types including speech-to-text transcription with Python, JavaScript, and curl examples.

Guardrails

- Never delete or modify audio files provided by the user for transcription.
Never send audio data to external services — all speech-to-text transcription is local.
Never delete or modify files in ~/.fleet-manager/.
If transcription fails, suggest checking node logs: tail ~/.fleet-manager/logs/herd.jsonl.
If no speech-to-text models available, suggest installing: uv tool install "mlx-qwen3-asr[serve]" --python 3.14.

本地语音转文字转录

您正在帮助某人使用语音转文字转录功能处理音频文件——会议记录、语音备忘录、播客节目、电话录音——无需将任何内容上传至云端。每个音频文件都保留在他们的设备上。设备集群会自动选择最佳节点来处理每次语音转文字转录任务。

为什么本地语音转文字转录很重要

云端语音转文字转录API按分钟收费，并将您的音频发送至第三方服务器。会议录音包含敏感的商业讨论内容。语音笔记包含个人想法。播客采访包含未发布的内容。这些都不应离开您的网络。本地转录可确保其私密性。

此技能可在您的设备集群中路由语音转文字转录请求。如果一台机器正在处理3小时的转录任务，下一个语音转文字请求将发送至另一台设备。转录队列管理、健康监控和仪表盘可视化——与云端语音转文字API相同的基础设施，完全运行在您的硬件上。

开始使用语音转文字转录

bash
pip install ollama-herd
herd # 启动转录路由器（端口11435）
herd-node # 在每台转录设备上启动
uv tool install mlx-qwen3-asr[serve] --python 3.14 # 安装语音转文字模型

启用语音转文字转录：

bash
curl -X POST http://localhost:11435/dashboard/api/settings \
-H Content-Type: application/json \
-d {transcription: true}

软件包：ollama-herd | 仓库：github.com/geeks-accelerator/ollama-herd

使用语音转文字转录音频

curl — 基础转录

bash

会议录音的语音转文字转录

curl -s http://localhost:11435/api/transcribe \
-F audio=@meeting-recording.wav | python3 -m json.tool

Python — 语音转文字转录

python
import httpx

def speechtotexttranscription(audiopath):
对音频文件执行语音转文字转录。
with open(audio_path, rb) as f:
transcription_resp = httpx.post(
http://localhost:11435/api/transcribe,
files={audio: (audio_path, f)},
timeout=300.0,
)
transcriptionresp.raisefor_status()
transcriptionresult = transcriptionresp.json()
return transcription_result[text]

执行语音转文字转录

transcriptiontext = speechtotexttranscription(meeting.wav) print(transcription_text)

带时间戳的语音转文字转录

python
def transcriptionwithtimestamps(audio_path):
返回带时间戳片段的语音转文字转录。
with open(audio_path, rb) as f:
transcription_resp = httpx.post(
http://localhost:11435/api/transcribe,
files={audio: (audio_path, f)},
timeout=300.0,
)
transcriptionresp.raisefor_status()
transcriptionresult = transcriptionresp.json()
for transcriptionchunk in transcriptionresult.get(chunks, []):
print(f[{transcriptionchunk[start]:.1f}s - {transcriptionchunk[end]:.1f}s] {transcription_chunk[text]})
return transcription_result

转录响应格式

json
{
transcription_text: 你好，这是对语音转文字转录系统的测试。,
language: 中文,
transcription_chunks: [
{
text: 你好，这是对语音转文字转录系统的测试。,
start: 0.0,
end: 3.2,
chunk_index: 0,
language: 中文
}
]
}

支持的转录音频格式

WAV、MP3、M4A、FLAC、MP4、OGG——任何FFmpeg支持的格式。WAV文件通过原生快速通道可获得约25%的转录速度提升。

语音转文字转录响应头

响应头	描述
X-Fleet-Node	执行语音转文字转录的设备
X-Fleet-Model

使用的转录模型（qwen3-asr） | | X-Transcription-Time | 转录处理时间（毫秒） |

语音转文字转录模型

Qwen3-ASR——2026年最先进的开源语音转文字转录模型。词错误率约5%，通过MLX在Apple Silicon上原生运行。0.6B转录模型使用约1.2GB内存，以0.08倍实时因子进行转录（10分钟的录音约需48秒完成转录）。

该集群也提供以下服务

同一路由器处理语音转文字转录以外的三种AI工作负载。所有端点均在http://localhost:11435：

LLM推理

bash
curl http://localhost:11435/v1/chat/completions \
-H Content-Type: application/json \
-d {model:gpt-oss:120b,messages:[{role:user,content:你好}]}

图像生成

bash
curl -o image.png http://localhost:11435/api/generate-image \
-H Content-Type: application/json \
-d {model:z-image-turbo,prompt:日落,width:1024,height:1024,steps:4}

嵌入向量

bash
curl http://localhost:11435/api/embeddings \
-d {model:nomic-embed-text,prompt:搜索查询}

监控语音转文字转录

bash

转录统计（最近24小时）

curl -s http://localhost:11435/dashboard/api/transcription-stats | python3 -m json.tool

集群健康状态（包含语音转文字转录活动）

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

仪表盘位于http://localhost:11435/dashboard——语音转文字转录队列显示[STT]标签，与LLM和图像队列并列。

完整文档

代理设置指南——所有4种模型类型的完整参考，包括使用Python、JavaScript和curl示例的语音转文字转录。

安全护栏

- 切勿删除或修改用户提供的用于转录的音频文件。
切勿将音频数据发送至外部服务——所有语音转文字转录均为本地处理。
切勿删除或修改~/.fleet-manager/中的文件。
如果转录失败，建议检查节点日志：tail ~/.fleet-manager/logs/herd.jsonl。
如果没有可用的语音转文字模型，建议安装：uv tool install mlx-qwen3-asr[serve] --python 3.14。

local-transcription本地语音转录

local-transcription

Local Speech-to-Text Transcription

Why local speech-to-text transcription matters

Get started with speech-to-text transcription

Transcribe audio with speech-to-text

curl — basic transcription

Python — speech-to-text transcription

Speech-to-text transcription with timestamps

Transcription response format

Supported audio formats for transcription

Speech-to-text transcription response headers

Speech-to-text transcription model

Also available on this fleet

LLM inference

Image generation

Embeddings

Monitoring speech-to-text transcription

Full documentation

Guardrails

本地语音转文字转录

为什么本地语音转文字转录很重要

开始使用语音转文字转录

使用语音转文字转录音频

curl — 基础转录

会议录音的语音转文字转录

Python — 语音转文字转录

执行语音转文字转录

带时间戳的语音转文字转录

转录响应格式

支持的转录音频格式

语音转文字转录响应头

语音转文字转录模型

该集群也提供以下服务

LLM推理

图像生成

嵌入向量

监控语音转文字转录

转录统计（最近24小时）

集群健康状态（包含语音转文字转录活动）

完整文档

安全护栏

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement