Whisper STT Skill
Free, local speech-to-text using OpenAI Whisper.
Prerequisites
Install dependencies (one-time setup):
CODEBLOCK0
Optional: Install ffmpeg for broader format support:
- - macOS: INLINECODE0
- Ubuntu: INLINECODE1
Usage
Transcribe an audio file
CODEBLOCK1
Options
| Option | Description |
|---|
| INLINECODE2 | Model size: tiny, base, small, medium, large, large-v3-turbo (default: base) |
| INLINECODE3 |
Language code: zh, en, ja, etc. (auto-detect if not specified) |
|
--output, -o | Output format: json, txt, srt, vtt (default: json) |
Examples
Chinese audio to text:
CODEBLOCK2
Generate subtitles (SRT):
CODEBLOCK3
Use faster model:
CODEBLOCK4
High accuracy (slower):
CODEBLOCK5
Model Selection Guide
| Model | Speed | Accuracy | VRAM/RAM | Best For |
|---|
| tiny | ~32x | Basic | ~1GB | Quick tests, low resource |
| base |
~16x | Good | ~1GB | Balanced speed/accuracy |
| small | ~6x | Better | ~2GB | Better accuracy |
| medium | ~2x | Very Good | ~5GB | High accuracy |
| large | 1x | Excellent | ~10GB | Best quality |
| large-v3-turbo | ~8x | Excellent | ~6GB | Fast + accurate (recommended) |
Troubleshooting
"ModuleNotFoundError: No module named 'whisper'"
→ Run: INLINECODE5
"ffmpeg not found"
→ Install ffmpeg or convert audio to WAV format first
Slow transcription
→ Use smaller model (tiny/base) or ensure GPU is available (Apple Silicon MPS, NVIDIA CUDA)
Poor accuracy on Chinese
→ Use --language zh explicitly and consider larger model (medium/large)
Output Formats
- - json: Full result with segments, timestamps, and metadata
- txt: Plain text transcription only
- srt: SubRip subtitle format with timing
- vtt: WebVTT subtitle format for web players
Credits
Powered by OpenAI Whisper - open source speech recognition.
Whisper STT 技能
使用OpenAI Whisper实现免费、本地的语音转文字功能。
前置条件
安装依赖(一次性配置):
bash
pip install openai-whisper torch
可选:安装ffmpeg以获得更广泛的格式支持:
- - macOS:brew install ffmpeg
- Ubuntu:sudo apt install ffmpeg
使用方法
转录音频文件
bash
python ~/.openclaw/skills/whisper-stt/scripts/transcribe.py <音频文件>
选项参数
| 选项 | 说明 |
|---|
| --model | 模型大小:tiny、base、small、medium、large、large-v3-turbo(默认:base) |
| --language, -l |
语言代码:zh、en、ja等(未指定时自动检测) |
| --output, -o | 输出格式:json、txt、srt、vtt(默认:json) |
示例
中文音频转文字:
bash
python ~/.openclaw/skills/whisper-stt/scripts/transcribe.py recording.m4a --language zh --output txt
生成字幕(SRT格式):
bash
python ~/.openclaw/skills/whisper-stt/scripts/transcribe.py video.mp4 --output srt > subtitles.srt
使用更快的模型:
bash
python ~/.openclaw/skills/whisper-stt/scripts/transcribe.py audio.mp3 --model tiny --output txt
高精度(较慢):
bash
python ~/.openclaw/skills/whisper-stt/scripts/transcribe.py audio.mp3 --model large-v3 --output txt
模型选择指南
| 模型 | 速度 | 准确度 | 显存/内存 | 最佳用途 |
|---|
| tiny | ~32倍 | 基础 | ~1GB | 快速测试、低资源环境 |
| base |
~16倍 | 良好 | ~1GB | 速度与准确度平衡 |
| small | ~6倍 | 较好 | ~2GB | 更高准确度 |
| medium | ~2倍 | 很好 | ~5GB | 高准确度 |
| large | 1倍 | 优秀 | ~10GB | 最佳质量 |
| large-v3-turbo | ~8倍 | 优秀 | ~6GB | 快速且准确(推荐) |
故障排除
ModuleNotFoundError: No module named whisper
→ 运行:pip install openai-whisper torch
ffmpeg not found
→ 安装ffmpeg或先将音频转换为WAV格式
转录速度慢
→ 使用更小的模型(tiny/base)或确保GPU可用(Apple Silicon MPS、NVIDIA CUDA)
中文准确度差
→ 明确使用--language zh参数,并考虑使用更大的模型(medium/large)
输出格式
- - json:包含分段、时间戳和元数据的完整结果
- txt:纯文本转录结果
- srt:SubRip字幕格式,带时间信息
- vtt:WebVTT字幕格式,适用于网页播放器
致谢
由OpenAI Whisper提供技术支持——开源语音识别系统。