Faster Whisper

Local speech-to-text using faster-whisper — a CTranslate2 reimplementation of OpenAI's Whisper that runs 4-6x faster with identical accuracy. With GPU acceleration, expect ~20x realtime transcription (a 10-minute audio file in ~30 seconds).

When to Use

Use this skill when you need to:

- Transcribe audio/video files — meetings, interviews, podcasts, lectures, YouTube videos
Generate subtitles — SRT, VTT, ASS, LRC, or TTML broadcast-standard subtitles
Identify speakers — diarization labels who said what (--diarize)
Transcribe from URLs — YouTube links and direct audio URLs (auto-downloads via yt-dlp)
Transcribe podcast feeds — --rss <feed-url> fetches and transcribes episodes
Batch process files — glob patterns, directories, skip-existing support; ETA shown automatically
Convert speech to text locally — no API costs, works offline (after model download)
Translate to English — translate any language to English with INLINECODE2
Do multilingual transcription — supports 99+ languages with auto-detection
Transcribe a batch of files in different languages — --language-map assigns a different language per file
Transcribe multilingual audio — --multilingual for mixed-language audio
Transcribe audio with specific terms — use --initial-prompt for jargon-heavy content or any other terms to look out for
Preprocess noisy audio (before transcription) — --normalize and --denoise before transcription
Stream output — --stream shows segments as they're transcribed
Clip time ranges — --clip-timestamps to transcribe specific sections
Search the transcript — --search "term" finds all timestamps where a word/phrase appears
Detect chapters — --detect-chapters finds section breaks from silence gaps
Export speaker audio — --export-speakers DIR saves each speaker's turns as separate WAV files
Spreadsheet output — --format csv produces a properly-quoted CSV with timestamps

Trigger phrases:
"transcribe this audio", "convert speech to text", "what did they say", "make a transcript",
"audio to text", "subtitle this video", "who's speaking", "translate this audio", "translate to English",
"find where X is mentioned", "search transcript for", "when did they say", "at what timestamp",
"add chapters", "detect chapters", "find breaks in the audio", "table of contents for this recording",
"TTML subtitles", "DFXP subtitles", "broadcast format subtitles", "Netflix format",
"ASS subtitles", "aegisub format", "advanced substation alpha", "mpv subtitles",
"LRC subtitles", "timed lyrics", "karaoke subtitles", "music player lyrics",
"HTML transcript", "confidence-colored transcript", "color-coded transcript",
"separate audio per speaker", "export speaker audio", "split by speaker",
"transcript as CSV", "spreadsheet output", "transcribe podcast", "podcast RSS feed",
"different languages in batch", "per-file language",
"transcribe in multiple formats", "srt and txt at the same time", "output both srt and text",
"remove filler words", "clean up ums and uhs", "strip hesitation sounds", "remove you know and I mean",
"transcribe left channel", "transcribe right channel", "stereo channel", "left track only",
"wrap subtitle lines", "character limit per line", "max chars per subtitle",
"detect paragraphs", "paragraph breaks", "group into paragraphs", "add paragraph spacing"

⚠️ Agent guidance — keep invocations minimal:

CORE RULE: default command (./scripts/transcribe audio.mp3) is the fastest path — add flags only when the user explicitly asks for that capability.

Transcription:

- Only add --diarize if the user asks "who said what" / "identify speakers" / "label speakers"
Only add --format srt/vtt/ass/lrc/ttml if the user asks for subtitles/captions in that format
Only add --format csv if the user asks for CSV or spreadsheet output
Only add --word-timestamps if the user needs word-level timing
Only add --initial-prompt if there's domain-specific jargon to prime
Only add --translate if the user wants non-English audio translated to English
Only add --normalize/--denoise if the user mentions bad audio quality or noise
Only add --stream if the user wants live/progressive output for long files
Only add --clip-timestamps if the user wants a specific time range
Only add --temperature 0.0 if the model is hallucinating on music/silence
Only add --vad-threshold if VAD is aggressively cutting speech or including noise
Only add --min-speakers/--max-speakers when you know the speaker count
Only add --hf-token if the token is not cached at INLINECODE30
Only add --max-words-per-line for subtitle readability on long segments
Only add --filter-hallucinations if the transcript contains obvious artifacts (music markers, duplicates)
Only add --merge-sentences if the user asks for sentence-level subtitle cues
Only add --clean-filler if the user asks to remove filler words (um, uh, you know, I mean, hesitation sounds)
Only add --channel left|right if the user mentions stereo tracks, dual-channel recordings, or asks for a specific channel
Only add --max-chars-per-line N when the user specifies a character limit per subtitle line (e.g., "Netflix format", "42 chars per line"); takes priority over INLINECODE37
Only add --detect-paragraphs if the user asks for paragraph breaks or structured text output; --paragraph-gap (default 3.0s) only if they want a custom gap
Only add --speaker-names "Alice,Bob" when the user provides real names to replace SPEAKER_1/2 — always requires INLINECODE41
Only add --hotwords WORDS when the user names specific rare terms not well served by --initial-prompt; prefer --initial-prompt for general domain jargon
Only add --prefix TEXT when the user knows the exact words the audio starts with
Only add --detect-language-only when the user only wants to identify the language, not transcribe
Only add --stats-file PATH if the user asks for performance stats, RTF, or benchmark info
Only add --parallel N for large CPU batch jobs; GPU handles one file efficiently on its own — don't add for single files or small batches
Only add --retries N for unreliable inputs (URLs, network files) where transient failures are expected
Only add --burn-in OUTPUT only when user explicitly asks to embed/burn subtitles into the video; requires ffmpeg and a video file input
Only add --keep-temp when the user may re-process the same URL to avoid re-downloading
Only add --output-template when user specifies a custom naming pattern in batch mode
Multi-format output (--format srt,text): only when user explicitly wants multiple formats in one pass; always pair with INLINECODE54
Any word-level feature auto-runs wav2vec2 alignment (~5-10s overhead)
INLINECODE55 adds ~20-30s on top of that

Search:

- Only add --search "term" when the user asks to find/locate/search for a specific word or phrase in audio
INLINECODE57 replaces the normal transcript output — it prints only matching segments with timestamps
Add --search-fuzzy only when the user mentions approximate/partial matching or typos
To save search results to a file, use INLINECODE59

Chapter detection:

- Only add --detect-chapters when the user asks for chapters, sections, a table of contents, or "where does the topic change"
Default --chapter-gap 8 (8-second silence = new chapter) works for most podcasts/lectures; tune down for dense content
INLINECODE62 (default) outputs YouTube-ready timestamps; use json for programmatic use
Always use --chapters-file PATH when combining chapters with a transcript output — avoids mixing chapter markers into the transcript text
If the user only wants chapters (not the transcript), pipe stdout to a file with -o /dev/null and use INLINECODE66
Batch mode limitation: --chapters-file takes a single path — in batch mode, each file's chapters overwrite the previous. For batch chapter detection, omit --chapters-file (chapters print to stdout under === CHAPTERS (N) ===) or use a separate run per file

Speaker audio export:

- Only add --export-speakers DIR when the user explicitly asks to save each speaker's audio separately
Always pair with --diarize — it silently skips if no speaker labels are present
Requires ffmpeg; outputs SPEAKER_1.wav, SPEAKER_2.wav, etc. (or real names if --speaker-names is set)

Language map:

- Only add --language-map in batch mode when the user has confirmed different languages across files
Inline format: "interview*.mp3=en,lecture*.mp3=fr" — fnmatch globs on filename
JSON file format: @/path/to/map.json where the file is INLINECODE78

RSS / Podcast:

- Only add --rss URL when the user provides a podcast RSS feed URL
Default fetches 5 newest episodes; --rss-latest 0 for all; --skip-existing to resume safely
Always use -o <dir> with --rss — without it, all episode transcripts print to stdout concatenated, which is hard to use; each episode gets its own file when -o <dir> is set

Output format for agent relay:

- Search results (--search) → print directly to user; output is human-readable
Chapter output → if no --chapters-file, chapters appear in stdout under === CHAPTERS (N) === header after the transcript; with --format json, chapters are also embedded in the JSON under "chapters" key
Subtitle formats (SRT, VTT, ASS, LRC, TTML) → always write to -o file; tell the user the output path, never paste raw subtitle content
Data formats (CSV, HTML, TTML, JSON) → always write to -o file; tell the user the output path, don't paste raw XML/CSV/HTML
ASS format → for Aegisub, VLC, mpv; write to file and tell user they can open it in Aegisub or play it in VLC/mpv
LRC format → timed lyrics for music players (Foobar2000, AIMP, VLC); write to file
Multi-format (--format srt,text) → requires -o <dir>; each format goes to a separate file; tell user all paths written
JSON format → useful for programmatic post-processing; not ideal to paste in full to user
Text/transcript → safe to show directly to user for short files; summarise for long ones
Stats output (--stats-file) → summarise key fields (duration, processing time, RTF) for the user rather than pasting raw JSON
Language detection (--detect-language-only) → print the result directly; it's a single line
ETA is printed automatically to stderr for batch jobs; no action needed

When NOT to use:

- Cloud-only environments without local compute
Files <10 seconds where API call latency doesn't matter

faster-whisper vs whisperx:
This skill covers everything whisperx does — diarization (--diarize), word-level timestamps (--word-timestamps), SRT/VTT subtitles — so whisperx is not needed. Use whisperx only if you specifically need its pyannote pipeline or batch-GPU features not covered here.

Quick Reference

Task	Command	Notes
Basic transcription	INLINECODE98	Batched inference, VAD on, distil-large-v3.5
SRT subtitles

./scripts/transcribe audio.mp3 --format srt -o subs.srt | Word timestamps auto-enabled | | VTT subtitles | ./scripts/transcribe audio.mp3 --format vtt -o subs.vtt | WebVTT format | | Word timestamps | ./scripts/transcribe audio.mp3 --word-timestamps --format srt | wav2vec2 aligned (~10ms) | | Speaker diarization | ./scripts/transcribe audio.mp3 --diarize | Requires pyannote.audio | | Translate → English | ./scripts/transcribe audio.mp3 --translate | Any language → English | | Stream output | ./scripts/transcribe audio.mp3 --stream | Live segments as transcribed | | Clip time range | ./scripts/transcribe audio.mp3 --clip-timestamps "30,60" | Only 30s–60s | | Denoise + normalize | ./scripts/transcribe audio.mp3 --denoise --normalize | Clean up noisy audio first | | Reduce hallucination | ./scripts/transcribe audio.mp3 --hallucination-silence-threshold 1.0 | Skip hallucinated silence | | YouTube/URL | ./scripts/transcribe https://youtube.com/watch?v=... | Auto-downloads via yt-dlp | | Batch process | ./scripts/transcribe *.mp3 -o ./transcripts/ | Output to directory | | Batch with skip | ./scripts/transcribe *.mp3 --skip-existing -o ./out/ | Resume interrupted batches | | Domain terms | ./scripts/transcribe audio.mp3 --initial-prompt 'Kubernetes gRPC' | Boost rare terminology | | Hotwords boost | ./scripts/transcribe audio.mp3 --hotwords 'JIRA Kubernetes' | Bias decoder toward specific words | | Prefix conditioning | ./scripts/transcribe audio.mp3 --prefix 'Good morning,' | Seed the first segment with known opening words | | Pin model version | ./scripts/transcribe audio.mp3 --revision v1.2.0 | Reproducible transcription with a pinned revision | | Debug library logs | ./scripts/transcribe audio.mp3 --log-level debug | Show faster_whisper internal logs | | Turbo model | ./scripts/transcribe audio.mp3 -m turbo | Alias for large-v3-turbo | | Faster English | ./scripts/transcribe audio.mp3 --model distil-medium.en -l en | English-only, 6.8x faster | | Maximum accuracy | ./scripts/transcribe audio.mp3 --model large-v3 --beam-size 10 | Full model | | JSON output | ./scripts/transcribe audio.mp3 --format json -o out.json | Programmatic access with stats | | Filter noise | ./scripts/transcribe audio.mp3 --min-confidence 0.6 | Drop low-confidence segments | | Hybrid quantization | ./scripts/transcribe audio.mp3 --compute-type int8_float16 | Save VRAM, minimal quality loss | | Reduce batch size | ./scripts/transcribe audio.mp3 --batch-size 4 | If OOM on GPU | | TSV output | ./scripts/transcribe audio.mp3 --format tsv -o out.tsv | OpenAI Whisper–compatible TSV | | Fix hallucinations | ./scripts/transcribe audio.mp3 --temperature 0.0 --no-speech-threshold 0.8 | Lock temperature + skip silence | | Tune VAD sensitivity | ./scripts/transcribe audio.mp3 --vad-threshold 0.6 --min-silence-duration 500 | Tighter speech detection | | Known speaker count | ./scripts/transcribe meeting.wav --diarize --min-speakers 2 --max-speakers 3 | Constrain diarization | | Subtitle word wrapping | ./scripts/transcribe audio.mp3 --format srt --word-timestamps --max-words-per-line 8 | Split long cues | | Private/gated model | ./scripts/transcribe audio.mp3 --hf-token hf_xxx | Pass token directly | | Show version | ./scripts/transcribe --version | Print faster-whisper version | | Upgrade in-place | ./setup.sh --update | Upgrade without full reinstall | | System check | ./setup.sh --check | Verify GPU, Python, ffmpeg, venv, yt-dlp, pyannote | | Detect language only | ./scripts/transcribe audio.mp3 --detect-language-only | Fast language ID, no transcription | | Detect language JSON | ./scripts/transcribe audio.mp3 --detect-language-only --format json | Machine-readable language detection | | LRC subtitles | ./scripts/transcribe audio.mp3 --format lrc -o lyrics.lrc | Timed lyrics format for music players | | ASS subtitles | ./scripts/transcribe audio.mp3 --format ass -o subtitles.ass | Advanced SubStation Alpha (Aegisub, mpv, VLC) | | Merge sentences | ./scripts/transcribe audio.mp3 --format srt --merge-sentences | Join fragments into sentence chunks | | Stats sidecar | ./scripts/transcribe audio.mp3 --stats-file stats.json | Write perf stats JSON after transcription | | Batch stats | ./scripts/transcribe *.mp3 --stats-file ./stats/ | One stats file per input in dir | | Template naming | ./scripts/transcribe audio.mp3 -o ./out/ --output-template "{stem}_{lang}.{ext}" | Custom batch output filenames | | Stdin input | ffmpeg -i input.mp4 -f wav - \| ./scripts/transcribe - | Pipe audio directly from stdin | | Custom model dir | ./scripts/transcribe audio.mp3 --model-dir ~/my-models | Custom HuggingFace cache dir | | Local model | ./scripts/transcribe audio.mp3 -m ./my-model-ct2 | CTranslate2 model dir | | HTML transcript | ./scripts/transcribe audio.mp3 --format html -o out.html | Confidence-colored | | Burn subtitles | ./scripts/transcribe video.mp4 --burn-in output.mp4 | Requires ffmpeg + video input | | Name speakers | ./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob" | Replaces SPEAKER_1/2 | | Filter hallucinations | ./scripts/transcribe audio.mp3 --filter-hallucinations | Removes artifacts | | Keep temp files | ./scripts/transcribe https://... --keep-temp | For URL re-processing | | Parallel batch | ./scripts/transcribe *.mp3 --parallel 4 -o ./out/ | CPU multi-file | | RTX 3070 recommended | ./scripts/transcribe audio.mp3 --compute-type int8_float16 | Saves ~1GB VRAM, minimal quality loss | | CPU thread count | ./scripts/transcribe audio.mp3 --threads 8 | Force CPU thread count (default: auto) | | Podcast RSS (latest 5) | ./scripts/transcribe --rss https://feeds.example.com/podcast.xml | Downloads & transcribes newest 5 episodes | | Podcast RSS (all episodes) | ./scripts/transcribe --rss https://... --rss-latest 0 -o ./episodes/ | All episodes, one file each | | Podcast + SRT subtitles | ./scripts/transcribe --rss https://... --format srt -o ./subs/ | Subtitle all episodes | | Retry on failure | ./scripts/transcribe *.mp3 --retries 3 -o ./out/ | Retry up to 3× with backoff on error | | CSV output | ./scripts/transcribe audio.mp3 --format csv -o out.csv | Spreadsheet-ready with header row; properly quoted | | CSV with speakers | ./scripts/transcribe audio.mp3 --diarize --format csv -o out.csv | Adds speaker column | | Language map (inline) | ./scripts/transcribe *.mp3 --language-map "interview*.mp3=en,lecture.wav=fr" | Per-file language in batch | | Language map (JSON) | ./scripts/transcribe *.mp3 --language-map @langs.json | JSON file: {"pattern": "lang"} | | Batch with ETA | ./scripts/transcribe *.mp3 -o ./out/ | Automatic ETA shown for each file in batch | | TTML subtitles | ./scripts/transcribe audio.mp3 --format ttml -o subtitles.ttml | Broadcast-standard DFXP/TTML (Netflix, BBC, Amazon) | | TTML with speaker labels | ./scripts/transcribe audio.mp3 --diarize --format ttml -o subtitles.ttml | Speaker-labeled TTML | | Search transcript | ./scripts/transcribe audio.mp3 --search "keyword" | Find timestamps where keyword appears | | Search to file | ./scripts/transcribe audio.mp3 --search "keyword" -o results.txt | Save search results | | Fuzzy search | ./scripts/transcribe audio.mp3 --search "aproximate" --search-fuzzy | Approximate/partial matching | | Detect chapters | ./scripts/transcribe audio.mp3 --detect-chapters | Auto-detect chapters from silence gaps | | Chapter gap tuning | ./scripts/transcribe audio.mp3 --detect-chapters --chapter-gap 5 | Chapters on gaps ≥5s (default: 8s) | | Chapters to file | ./scripts/transcribe audio.mp3 --detect-chapters --chapters-file ch.txt | Save YouTube-format chapter list | | Chapters JSON | ./scripts/transcribe audio.mp3 --detect-chapters --chapter-format json | Machine-readable chapter list | | Export speaker audio | ./scripts/transcribe audio.mp3 --diarize --export-speakers ./speakers/ | Save each speaker's audio to separate WAV files | | Multi-format output | ./scripts/transcribe audio.mp3 --format srt,text -o ./out/ | Write SRT + TXT in one pass | | Remove filler words | ./scripts/transcribe audio.mp3 --clean-filler | Strip um/uh/er/ah/hmm and discourse markers | | Left channel only | ./scripts/transcribe audio.mp3 --channel left | Extract left stereo channel before transcribing | | Right channel only | ./scripts/transcribe audio.mp3 --channel right | Extract right stereo channel | | Max chars per line | ./scripts/transcribe audio.mp3 --format srt --max-chars-per-line 42 | Character-based subtitle wrapping | | Detect paragraphs | ./scripts/transcribe audio.mp3 --detect-paragraphs | Insert paragraph breaks in text output | | Paragraph gap tuning | ./scripts/transcribe audio.mp3 --detect-paragraphs --paragraph-gap 5.0 | Tune gap threshold (default 3.0s) |

Model Selection

Choose the right model for your needs:

CODEBLOCK0

Model Table

Standard Models (Full Whisper)

Model	Size	Speed	Accuracy	Use Case
INLINECODE177 / INLINECODE178	39M	Fastest	Basic	Quick drafts
INLINECODE179 / INLINECODE180

Distilled Models (~6x Faster, ~1% WER difference)

Model	Size	Speed vs Standard	Accuracy	Use Case
`distil-large-v3.5`	756M	~6.3x faster	7.08% WER	Default, best balance
INLINECODE188

756M | ~6.3x faster | 7.53% WER | Previous default | | distil-large-v2 | 756M | ~5.8x faster | 10.1% WER | Fallback | | distil-medium.en | 394M | ~6.8x faster | 11.1% WER | English-only, resource-constrained | | distil-small.en | 166M | ~5.6x faster | 12.1% WER | Mobile/edge devices |

INLINECODE192 models are English-only and slightly faster/better for English content.

Note for distil models: HuggingFace recommends disabling condition_on_previous_text for all distil models to prevent repetition loops. The script auto-applies --no-condition-on-previous-text whenever a distil-* model is detected. Pass --condition-on-previous-text to override if needed.

Custom & Fine-tuned Models

WhisperModel accepts local CTranslate2 model directories and HuggingFace repo names — no code changes needed.

Load a local CTranslate2 model

CODEBLOCK1

Convert a HuggingFace model to CTranslate2

CODEBLOCK2

Load a model by HuggingFace repo name (auto-downloads)

CODEBLOCK3

Custom model cache directory

By default, models are cached in ~/.cache/huggingface/. Use --model-dir to override:

CODEBLOCK4

Setup

Linux / macOS / WSL2

CODEBLOCK5

Requirements:

- Python 3.10+
ffmpeg is not required for basic transcription — PyAV (bundled with faster-whisper) handles audio decoding. ffmpeg is only needed for --burn-in, --normalize, and --denoise.
Optional: yt-dlp (for URL/YouTube input)
Optional: pyannote.audio (for --diarize, installed via setup.sh --diarize)

Platform Support

Platform	Acceleration	Speed
Linux + NVIDIA GPU	CUDA	~20x realtime 🚀
WSL2 + NVIDIA GPU

\*faster-whisper uses CTranslate2 which is CPU-only on macOS, but Apple Silicon is fast enough for practical use.

GPU Support (IMPORTANT!)

The setup script auto-detects your GPU and installs PyTorch with CUDA. Always use GPU if available — CPU transcription is extremely slow.

Hardware	Speed	9-min video
RTX 3070 (GPU)	~20x realtime	~27 sec
CPU (int8)

~0.3x realtime | ~30 min |

RTX 3070 tip: Use --compute-type int8_float16 for hybrid quantization — saves ~1GB VRAM with minimal quality loss. Ideal for running diarization alongside transcription.

If setup didn't detect your GPU, manually install PyTorch with CUDA:

CODEBLOCK6

- WSL2 users: Ensure you have the NVIDIA CUDA drivers for WSL installed on Windows

Usage

CODEBLOCK7

Options

CODEBLOCK8

Output Formats

Text (default)

Plain transcript text. With --diarize, speaker labels are inserted:

CODEBLOCK9

JSON (`--format json`)

Full metadata including segments, timestamps, language detection, and performance stats:

CODEBLOCK10

SRT (`--format srt`)

Standard subtitle format for video players:

CODEBLOCK11

VTT (`--format vtt`)

WebVTT format for web video players:

CODEBLOCK12

TSV (`--format tsv`)

Tab-separated values, OpenAI Whisper–compatible. Columns: start_ms, end_ms, text:

CODEBLOCK13

Useful for piping into other tools or spreadsheets. No header row.

ASS/SSA (`--format ass`)

Advanced SubStation Alpha format — supported by Aegisub, VLC, mpv, MPC-HC, and most video editors. Offers richer styling than SRT (font, size, color, position) via the [V4+ Styles] section:

CODEBLOCK14

Timestamps use H:MM:SS.cc (centiseconds). Edit the [V4+ Styles] block in Aegisub to customise font, color, and position without re-transcribing.

LRC (`--format lrc`)

Timed lyrics format used by music players (e.g., Foobar2000, VLC, AIMP). Timestamps use [mm:ss.xx] where xx = centiseconds:

CODEBLOCK15

With diarization, speaker labels are included:

CODEBLOCK16

Default file extension: .lrc. Useful for music transcription, karaoke, and any workflow requiring timed text with music-player compatibility.

Speaker Diarization

Identifies who spoke when using pyannote.audio.

Setup:

CODEBLOCK17

Requirements:

- HuggingFace token at ~/.cache/huggingface/token (huggingface-cli login)
Accepted model agreements:

- https://hf.co/pyannote/speaker-diarization-3.1 - https://hf.co/pyannote/segmentation-3.0

Usage:

CODEBLOCK18

Speakers are labeled SPEAKER_1, SPEAKER_2, etc. in order of first appearance. Diarization runs on GPU automatically if CUDA is available.

Precise Word Timestamps

Whenever word-level timestamps are computed (--word-timestamps, --diarize, or --min-confidence), a wav2vec2 forced alignment pass automatically refines them from Whisper's ~100-200ms accuracy to ~10ms. No extra flag needed.

CODEBLOCK19

Uses the MMS (Massively Multilingual Speech) model from torchaudio — supports 1000+ languages. The model is cached after first load, so batch processing stays fast.

URL & YouTube Input

Pass any URL as input — audio is downloaded automatically via yt-dlp:

CODEBLOCK20

Requires yt-dlp (checks PATH and ~/.local/share/pipx/venvs/yt-dlp/bin/yt-dlp).

Batch Processing

Process multiple files at once with glob patterns, directories, or multiple paths:

CODEBLOCK21

When outputting to a directory, files are named {input-stem}.{ext} (e.g., audio.mp3 → audio.srt).

Batch mode prints a summary after all files complete:

CODEBLOCK22

Workflows

End-to-end pipelines for common use cases.

Podcast Transcription Pipeline

Fetch and transcribe the latest 5 episodes from any podcast RSS feed:

CODEBLOCK23

Meeting Notes Pipeline

Transcribe a meeting recording with speaker labels, then output clean text:

CODEBLOCK24

Video Subtitle Pipeline

Generate ready-to-use subtitles for a video file:

CODEBLOCK25

YouTube Batch Pipeline

Transcribe multiple YouTube videos at once:

CODEBLOCK26

Noisy Audio Pipeline

Clean up poor-quality recordings before transcribing:

CODEBLOCK27

Batch Recovery Pipeline

Process a large folder with retries — safe to re-run after failures:

CODEBLOCK28

Server Mode (OpenAI-Compatible API)

speaches runs faster-whisper as an OpenAI-compatible /v1/audio/transcriptions endpoint — drop-in replacement for OpenAI Whisper API with streaming, Docker support, and live transcription.

Quick start (Docker)

CODEBLOCK29

Test it

CODEBLOCK30

Use with any OpenAI SDK

CODEBLOCK31

Useful when you want to expose transcription as a local API for other tools (Home Assistant, n8n, custom apps).

Common Mistakes

Mistake	Problem	Solution
Using CPU when GPU available	10-20x slower transcription	Check `nvidia-smi`; verify CUDA installation
Not specifying language

Performance Notes

- First run: Downloads model to ~/.cache/huggingface/ (one-time)
Batched inference: Enabled by default via BatchedInferencePipeline — ~3x faster than standard mode; VAD on by default
GPU: Automatically uses CUDA if available
Quantization: INT8 used on CPU for ~4x speedup with minimal accuracy loss
Performance stats: Every transcription shows audio duration, processing time, and realtime factor
Benchmark (RTX 3070, 21-min file): ~24s with batched inference (both distil-large-v3 and v3.5) vs ~69s without
--precise overhead: Adds ~5-10s for wav2vec2 model load + alignment (model cached for batch)
Diarization overhead: Adds ~10-30s depending on audio length (runs on GPU if available)
Memory:

- distil-large-v3: ~2GB RAM / ~1GB VRAM - large-v3-turbo: ~4GB RAM / ~2GB VRAM - tiny/base: <1GB RAM - Diarization: additional ~1-2GB VRAM

- OOM: Lower --batch-size (try 4) if you hit out-of-memory errors
Pre-convert to WAV (optional): ffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav converts to 16kHz mono WAV before transcription. Benefit is minimal (~5%) for one-off use since PyAV decodes efficiently — most useful when re-processing the same file multiple times (research/experiments) or when a format causes PyAV decode issues. Note: --normalize and --denoise already perform this conversion automatically.
Silero VAD V6: faster-whisper 1.2.1 upgraded to Silero VAD V6 (improved speech detection). Run ./setup.sh --update to get it.
Batched silence removal: faster-whisper 1.2.0+ automatically removes silence in BatchedInferencePipeline (used by default). Upgrade with ./setup.sh --update to get this if you installed before August 2024.

Why faster-whisper?

- Speed: ~4-6x faster than OpenAI's original Whisper
Accuracy: Identical (uses same model weights)
Efficiency: Lower memory usage via quantization
Production-ready: Stable C++ backend (CTranslate2)
Distilled models: ~6x faster with <1% accuracy loss
Subtitles: Native SRT/VTT/HTML output
Precise alignment: Automatic wav2vec2 refinement (~10ms word boundaries)
Diarization: Optional speaker identification via pyannote; --speaker-names maps to real names
URLs: Direct YouTube/URL input; --keep-temp preserves downloads for re-use
Custom models: Load local CTranslate2 dirs or HuggingFace repos; --model-dir controls cache
Quality control: --filter-hallucinations strips music/applause markers and duplicates
Parallel batch: --parallel N for multi-threaded batch processing
Subtitle burn-in: --burn-in overlays subtitles directly into video via ffmpeg

v1.5.0 New Features

Multi-format output:

- --format srt,text — write multiple formats in one pass (e.g. SRT + plain text simultaneously)
Comma-separated list accepted: srt,vtt,json, srt,text, etc.
Requires -o <dir> when writing multiple formats; single format unchanged

Filler word removal:

- --clean-filler — strip hesitation sounds (um, uh, er, ah, hmm, hm) and discourse markers

(you know, I mean, you see) from transcript text; off by default

- Conservative regex matching at word boundaries to avoid false positives
Segments that become empty after cleaning are dropped automatically

Stereo channel selection:

- --channel left|right|mix — extract a specific stereo channel before transcribing (default: mix)
Useful for dual-track recordings (interviewer on left, interviewee on right)
Uses ffmpeg pan filter; falls back gracefully to full mix if ffmpeg not found

Character-based subtitle wrapping:

- --max-chars-per-line N — split subtitle cues so each line fits within N characters
Works for SRT, VTT, ASS, and TTML formats; takes priority over INLINECODE270
Requires word-level timestamps; falls back to full segment if no word data

Paragraph detection:

- --detect-paragraphs — insert \n\n paragraph breaks in text output at natural boundaries
INLINECODE273 — minimum silence gap for a paragraph (default: 3.0s)
Also detects paragraph breaks when the previous segment ends a sentence and gap ≥ 1.5s

Subtitle formats:

- --format ass — Advanced SubStation Alpha (Aegisub, VLC, mpv, MPC-HC)
INLINECODE275 — Timed lyrics format for music players
INLINECODE276 — Confidence-colored HTML transcript (green/yellow/red per word)
INLINECODE277 — W3C TTML 1.0 (DFXP) broadcast standard (Netflix, Amazon Prime, BBC)
INLINECODE278 — Spreadsheet-ready CSV with header row; RFC 4180 quoting; speaker column when diarized

Transcript tools:

- --search TERM — Find all timestamps where a word/phrase appears; replaces normal output; -o to save
INLINECODE282 — Approximate/partial matching with INLINECODE283
INLINECODE284 — Auto-detect chapter breaks from silence gaps; --chapter-gap SEC (default 8s)
INLINECODE286 — Write chapters to file instead of stdout; INLINECODE287
INLINECODE288 — After --diarize, save each speaker's turns as separate WAV files via ffmpeg

Batch improvements:

- ETA — [N/total] filename | ETA: Xm Ys shown before each file in sequential batch; no flag needed
INLINECODE291 — Per-file language override; fnmatch glob patterns; @file.json form
INLINECODE293 — Retry failed files with exponential backoff; failed-file summary at end
INLINECODE294 — Transcribe podcast RSS feeds; --rss-latest N for episode count
INLINECODE296 / --parallel N / --output-template / --stats-file / INLINECODE300

Model & inference:

- distil-large-v3.5 default (replaced distil-large-v3)
Auto-disables condition_on_previous_text for distil models (prevents repetition loops)
INLINECODE303 to override; --log-level for library debug output
INLINECODE305 — Custom HuggingFace cache dir; local CTranslate2 model support
INLINECODE306, --chunk-length, --length-penalty, --repetition-penalty, INLINECODE310
INLINECODE311, --stream, --progress, --best-of, --patience, INLINECODE316
INLINECODE317, --prefix, --revision, --suppress-tokens, INLINECODE321

Speaker & quality:

- --speaker-names "Alice,Bob" — Replace SPEAKER_1/2 with real names (requires --diarize)
INLINECODE324 — Remove music/applause markers, duplicates, "Thank you for watching"
INLINECODE325 — Burn subtitles into video via ffmpeg
INLINECODE326 — Preserve URL-downloaded audio for re-processing

Setup:

- setup.sh --check — System diagnostic: GPU, CUDA, Python, ffmpeg, pyannote, HuggingFace token (completes in ~12s)
ffmpeg no longer required for basic transcription (PyAV handles decoding); skill.json updated to reflect this (ffmpeg is now optionalBins)

Troubleshooting

"CUDA not available — using CPU": Install PyTorch with CUDA (see GPU Support above)
Setup fails: Make sure Python 3.10+ is installed
Out of memory: Use smaller model, --compute-type int8, or --batch-size 4
Slow on CPU: Expected — use GPU for practical transcription
Model download fails: Check ~/.cache/huggingface/ permissions
Diarization model fails: Ensure HuggingFace token exists and model agreements accepted;
or pass token directly with --hf-token hf_xxx
URL download fails: Check yt-dlp is installed (pipx install yt-dlp)
No audio files in batch: Check file extensions match supported formats
Check installed version: Run ./scripts/transcribe --version
Upgrade faster-whisper: Run ./setup.sh --update (upgrades in-place, no full reinstall)
Hallucinations on silence/music: Try --temperature 0.0 --no-speech-threshold 0.8
VAD splits speech incorrectly: Tune with --vad-threshold 0.3 (lower) or `--min-silence-duration 30

Faster Whisper

使用 faster-whisper 进行本地语音转文字——这是 OpenAI Whisper 的 CTranslate2 重新实现，在保持相同准确率的同时，运行速度提升 4-6 倍。配合 GPU 加速，可实现约 20 倍实时转录（10 分钟音频文件约 30 秒完成）。

适用场景

当您需要以下功能时，可使用此技能：

- 转录音频/视频文件 — 会议、访谈、播客、讲座、YouTube 视频
生成字幕 — SRT、VTT、ASS、LRC 或 TTML 广播标准字幕
识别说话人 — 话者分离标记谁说了什么（--diarize）
从 URL 转录 — YouTube 链接和直接音频 URL（通过 yt-dlp 自动下载）
转录播客订阅源 — --rss 获取并转录剧集
批量处理文件 — 支持通配符模式、目录、跳过已存在文件；自动显示预计完成时间
本地语音转文字 — 无 API 费用，离线可用（模型下载后）
翻译为英文 — 使用 --translate 将任何语言翻译为英文
多语言转录 — 支持 99+ 种语言，自动检测
批量处理不同语言的文件 — --language-map 为每个文件指定不同语言
转录多语言音频 — --multilingual 用于混合语言音频
转录包含特定术语的音频 — 使用 --initial-prompt 处理专业术语密集的内容或任何需要关注的词汇
预处理嘈杂音频（转录前） — 转录前使用 --normalize 和 --denoise
流式输出 — --stream 实时显示转录片段
裁剪时间范围 — --clip-timestamps 转录特定段落
搜索转录文本 — --search term 查找单词/短语出现的所有时间戳
检测章节 — --detect-chapters 从静音间隙中查找段落分隔
导出说话人音频 — --export-speakers DIR 将每位说话人的发言保存为单独的 WAV 文件
电子表格输出 — --format csv 生成带正确引用的 CSV 文件，包含时间戳

触发短语：
转录这段音频、语音转文字、他们说了什么、生成转录、
音频转文字、给这个视频加字幕、谁在说话、翻译这段音频、翻译成英文、
查找提到 X 的位置、搜索转录文本、他们什么时候说的、在哪个时间戳、
添加章节、检测章节、查找音频中的断点、为这段录音生成目录、
TTML 字幕、DFXP 字幕、广播格式字幕、Netflix 格式、
ASS 字幕、aegisub 格式、高级子站阿尔法、mpv 字幕、
LRC 字幕、定时歌词、卡拉 OK 字幕、音乐播放器歌词、
HTML 转录、置信度着色转录、颜色编码转录、
按说话人分离音频、导出说话人音频、按说话人分割、
转录为 CSV、电子表格输出、转录播客、播客 RSS 订阅源、
批量处理不同语言、按文件指定语言、
多格式转录、同时输出 srt 和 txt、同时输出 srt 和文本、
删除填充词、清理 um 和 uh、去除犹豫声音、删除 you know 和 I mean、
转录左声道、转录右声道、立体声声道、仅左声道、
字幕换行、每行字符限制、每行最大字符数、
检测段落、段落分隔、分组为段落、添加段落间距

⚠️ 代理指导 — 保持调用最小化：

核心规则：默认命令（./scripts/transcribe audio.mp3）是最快的路径——仅在用户明确要求该功能时才添加参数。

转录：

- 仅当用户询问谁说了什么/识别说话人/标记说话人时才添加 --diarize
仅当用户要求该格式的字幕/标题时才添加 --format srt/vtt/ass/lrc/ttml
仅当用户要求 CSV 或电子表格输出时才添加 --format csv
仅当用户需要单词级时间戳时才添加 --word-timestamps
仅当有领域特定术语需要提示时才添加 --initial-prompt
仅当用户希望将非英语音频翻译为英文时才添加 --translate
仅当用户提到音频质量差或有噪音时才添加 --normalize/--denoise
仅当用户希望长文件有实时/渐进输出时才添加 --stream
仅当用户想要特定时间范围时才添加 --clip-timestamps
仅当模型在音乐/静音上产生幻觉时才添加 --temperature 0.0
仅当 VAD 过于激进地切割语音或包含噪音时才添加 --vad-threshold
仅当您知道说话人数量时才添加 --min-speakers/--max-speakers
仅当令牌未缓存在 ~/.cache/huggingface/token 时才添加 --hf-token
仅当长片段需要提高字幕可读性时才添加 --max-words-per-line
仅当转录文本包含明显伪影（音乐标记、重复）时才添加 --filter-hallucinations
仅当用户要求句子级字幕提示时才添加 --merge-sentences
仅当用户要求删除填充词（um、uh、you know、I mean、犹豫声音）时才添加 --clean-filler
仅当用户提到立体声轨道、双声道录音或要求特定声道时才添加 --channel left|right
仅当用户指定每行字幕的字符限制（如Netflix 格式、每行 42 个字符）时才添加 --max-chars-per-line N；优先于 --max-words-per-line
仅当用户要求段落分隔或结构化文本输出时才添加 --detect-paragraphs；--paragraph-gap（默认 3.0 秒）仅在用户想要自定义间隔时添加
仅当用户提供真实姓名替换 SPEAKER_1/2 时才添加 --speaker-names Alice,Bob——始终需要 --diarize
仅当用户指定 --initial-prompt 无法很好处理的特定稀有术语时才添加 --hotwords WORDS；对于一般领域术语，优先使用 --initial-prompt
仅当用户知道音频开头的确切单词时才添加 --prefix TEXT
仅当用户只想识别语言而不转录时才添加 --detect-language-only
仅当用户要求性能统计、RTF 或基准信息时才添加 --stats-file PATH
仅用于大型 CPU 批量作业时添加 --parallel N；GPU 本身就能高效处理单个文件——不要为单个文件或小批量添加
仅用于不可靠输入（URL、网络文件）且预期有临时故障时才添加 --retries N
仅当用户明确要求将字幕嵌入/烧录到视频中时才添加 --burn-in OUTPUT；需要 ffmpeg 和视频文件输入
仅当用户可能重新处理同一 URL 以避免重新下载时才添加 --keep-temp
仅当用户在批处理模式下指定自定义命名模式时才添加 --output-template
多格式输出（--format srt,text）：仅当用户明确要求一次生成多种格式时；始终与 -o 配对使用
任何单词级功能都会自动运行 wav2vec2 对齐（约 5-10 秒开销）
--diarize 在此基础上增加约 20-30 秒

搜索：

- 仅当用户要求在音频中查找/定位/搜索特定单词或短语时才添加 --search term
--search 替换正常的转录输出——它只打印带有时间戳的匹配片段
仅当用户提到近似/部分匹配或拼写错误时才添加 --search-fuzzy
要将搜索结果保存到文件，使用 -o results.txt

章节检测：

- 仅当用户要求章节、段落、目录或主题在哪里变化时才添加 --detect-chapters
默认 --chapter-gap 8（8 秒静音 = 新章节）适用于大多数播客/讲座；对于密集内容可调低
--chapter-format youtube（默认）输出 YouTube 就绪的时间戳；使用 json

faster-whisper快速语音转文字

faster-whisper

Faster Whisper

When to Use

Quick Reference

Model Selection

Model Table

Standard Models (Full Whisper)

Distilled Models (~6x Faster, ~1% WER difference)

Custom & Fine-tuned Models

Load a local CTranslate2 model

Convert a HuggingFace model to CTranslate2

Load a model by HuggingFace repo name (auto-downloads)

Custom model cache directory

Setup

Linux / macOS / WSL2

Platform Support

GPU Support (IMPORTANT!)

Usage

Options

Output Formats

Text (default)

JSON (--format json)

SRT (--format srt)

VTT (--format vtt)

TSV (--format tsv)

ASS/SSA (--format ass)

LRC (--format lrc)

Speaker Diarization

Precise Word Timestamps

URL & YouTube Input

Batch Processing

Workflows

Podcast Transcription Pipeline

Meeting Notes Pipeline

Video Subtitle Pipeline

YouTube Batch Pipeline

Noisy Audio Pipeline

Batch Recovery Pipeline

Server Mode (OpenAI-Compatible API)

Quick start (Docker)

Test it

Use with any OpenAI SDK

Common Mistakes

Performance Notes

Why faster-whisper?

v1.5.0 New Features

Troubleshooting

Faster Whisper

适用场景

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

JSON (`--format json`)

SRT (`--format srt`)

VTT (`--format vtt`)

TSV (`--format tsv`)

ASS/SSA (`--format ass`)

LRC (`--format lrc`)