YouTube AnyCaption Summarizer
The YouTube summarizer that still works when captions are broken, missing, or inconsistent.
Outputs: raw markdown transcript + polished markdown summary + session-ready result block.
Unlike caption-only tools, this skill still works when subtitles are missing by falling back to local Whisper transcription.
Generate a raw transcript markdown file and a polished summary markdown file from one or more YouTube videos.
This skill is self-contained. It does not require any other YouTube summarizer skill or prior workflow context.
Best for
- - founder videos, operator walkthroughs, and technical explainers
- long tutorial videos that need transcript + implementation summary
- private/internal YouTube uploads that may require cookies
- mixed-caption environments where some videos have CC, some only have auto-captions, and some have no usable subtitles
- batch research workflows where many YouTube links need standardized markdown outputs
- users who want reliable markdown artifacts, not just a one-off chat summary
Why choose this over simpler transcript skills?
- - manual CC first, auto-captions second, local Whisper fallback last
- keeps working when subtitle coverage is weak or missing
- supports private/restricted YouTube videos via cookies
- returns durable markdown artifacts, not just chat text
- supports batch processing and session-ready completion reporting
Install dependencies
For a fresh macOS setup, new users should be able to copy-paste the following exactly:
CODEBLOCK0
What this does:
- - installs
yt-dlp, ffmpeg, and INLINECODE2 - creates the default models directory used by this skill if it does not already exist: INLINECODE3
- downloads the default Whisper model file only if it is missing
- avoids touching
~/.openclaw/openclaw.json or any other OpenClaw config file - does not delete, replace, or overwrite other files in your existing workspace folder
- verifies that the required binaries and model file are present
If you want to store models elsewhere, pass --models-dir /path/to/models when running the workflow.
Example requests
- - “Summarize this YouTube video into markdown.”
- “Generate a transcript and polished summary for this YouTube link.”
- “Process this private YouTube video with my browser cookies.”
- “Batch summarize these YouTube links and give me transcript + summary files.”
- “Use subtitles when available, otherwise transcribe locally.”
- “Create a Chinese summary from this English YouTube video.”
Quick start
Single video
CODEBLOCK1
This creates a dedicated per-video folder, writes the raw transcript markdown, creates the summary placeholder markdown, and prints JSON describing the outputs plus the exact follow-up commands/prompts needed to finish the summary step.
Important: the workflow script alone is not the finished deliverable. The current OpenClaw session must still:
- 1. infer/backfill the language if the workflow left it as INLINECODE6
- overwrite the placeholder
Summary.md with a real polished summary - run
scripts/complete_youtube_summary.py to validate/finalize the result
Force simplified Chinese summary
CODEBLOCK2
Restricted video with cookies
CODEBLOCK3
or
CODEBLOCK4
Batch / queue mode
See references/batch-input-format.md.
CODEBLOCK5
Why this skill stands out
This skill is designed to keep working across the messy reality of YouTube:
- - if a video has manual closed captions (CC), use them first
- if it only has auto-generated subtitles, use those next
- if it has no usable subtitles at all, fall back to local Whisper transcription
That makes it materially more reliable than caption-only workflows. It works well for caption-rich videos, caption-poor videos, and private/internal uploads where subtitle coverage is inconsistent.
Core capabilities:
- - fetch YouTube metadata first and derive safe output paths
- support single-video mode and batch / queue mode
- handle manual CC, auto-generated subtitles, or no subtitles via subtitle-first extraction with local Whisper fallback
- support restricted/private videos via cookies or browser-cookie extraction
- normalize noisy transcript text before summarization
- create a placeholder summary file, overwrite it with the final summary, and finalize end-to-end timing
- clean up only known intermediates created by the workflow unless explicitly told otherwise
What this skill produces
For each video, create exactly one dedicated output folder containing these final deliverables:
- - INLINECODE10
- INLINECODE11
By default, delete only the known intermediate media, subtitle, and WAV files created by the workflow. Do not wipe unrelated files that may already exist in the per-video folder.
Required local tools
Verify these tools exist before running the workflow:
- - INLINECODE12
- INLINECODE13
- INLINECODE14
- INLINECODE15
The workflow also requires a supported Whisper ggml model file in the configured models directory.
Bundled scripts
Use these scripts directly:
- -
scripts/run_youtube_workflow.py — main deterministic workflow for metadata, download/subtitles, transcription, placeholder summary creation, cleanup, and workflow metadata emission - INLINECODE17 — update
transcript_raw.md, Summary.md, and workflow metadata after the current session LLM decides the major transcript language - INLINECODE20 — validate that
Summary.md is no longer a placeholder, optionally backfill language, compute the final end-to-end timing report for one item, and emit a session-ready result block - INLINECODE22 — convert raw timestamped transcript text into cleaner summary input without modifying the raw transcript file
- INLINECODE23 — lower-level timing helper used by the completion flow
- INLINECODE24 — derive sanitized folder and output file paths from a title and video ID
Useful references:
- -
references/detailed-workflow.md — full operational workflow, completion rules, batch guidance, naming rules, and practical notes - INLINECODE26 — required structure and writing rules for the final INLINECODE27
- INLINECODE28 — required user-facing output format to return to the current OpenClaw session after completion
- INLINECODE29 — input format for queue / batch processing
Defaults
- - Default parent output folder: INLINECODE30
- Default whisper model: INLINECODE31
- Supported whisper models:
ggml-base, ggml-small, INLINECODE34 - Default media mode: audio-only
- Default transcript language: auto-detect if transcription is needed
- Default summary language: INLINECODE35
- Raw transcript keeps timestamps
Public workflow overview
At a high level, the skill does this:
- 1. fetch metadata first and create safe output paths
- try manual subtitles, then auto-captions, then local Whisper fallback
- write INLINECODE36
- create
SANITIZED_VIDEO_NAME_Summary.md as a placeholder - have the current OpenClaw session overwrite the placeholder with a real summary
- run
scripts/complete_youtube_summary.py to validate completion, backfill language if needed, and emit a session-ready result block
What counts as completion
For a normal end-to-end request, completion means all of the following are true:
- 1. the workflow script succeeded
- if language was initially
unknown, the language was backfilled into both markdown files - the placeholder summary file was overwritten with a real summary
- INLINECODE40 was run successfully
- the user received the resulting output paths and timing/result status
If the workflow script succeeded but the summary/completion step did not happen yet, describe the state as partial/in-progress rather than complete.
When to read the deeper references
Read these as needed:
- -
references/detailed-workflow.md when you need the full implementation contract, batch guidance, naming rules, cleanup rules, timing flow, or debugging details - INLINECODE42 before writing the final polished INLINECODE43
- INLINECODE44 before returning the final user-facing per-video result block
- INLINECODE45 when handling INLINECODE46
Practical public promise
This skill is optimized for dependable end-to-end output, not just quick transcript extraction:
- - raw transcript markdown
- polished summary markdown
- session-ready completion report
YouTube AnyCaption Summarizer
即使在字幕损坏、缺失或不一致的情况下,依然能正常工作的YouTube摘要工具。
输出:原始Markdown转录文本 + 精炼Markdown摘要 + 会话就绪结果块。
与仅依赖字幕的工具不同,当字幕缺失时,此技能会回退到本地Whisper转录,因此依然能正常工作。
从一个或多个YouTube视频生成原始转录Markdown文件和精炼摘要Markdown文件。
此技能是自包含的。它不需要任何其他YouTube摘要技能或先前的工作流上下文。
最佳适用场景
- - 创始人视频、操作演示和技术讲解
- 需要转录文本+实现摘要的长教程视频
- 可能需要Cookie的私有/内部YouTube上传内容
- 混合字幕环境:部分视频有CC字幕,部分仅有自动字幕,部分完全没有可用字幕
- 需要批量处理多个YouTube链接并输出标准化Markdown的研究工作流
- 希望获得可靠Markdown制品(而非一次性聊天摘要)的用户
为何选择此技能而非更简单的转录技能?
- - 优先使用手动CC字幕,其次自动字幕,最后本地Whisper回退
- 在字幕覆盖薄弱或缺失时仍能正常工作
- 通过Cookie支持私有/受限YouTube视频
- 返回持久的Markdown制品,而非仅聊天文本
- 支持批量处理和会话就绪的完成报告
安装依赖
对于全新的macOS环境,新用户可直接复制粘贴以下命令:
bash
brew install yt-dlp ffmpeg whisper-cpp
MODELS_DIR=$HOME/.openclaw/workspace
MODELPATH=$MODELSDIR/ggml-medium.bin
mkdir -p $MODELS_DIR
if [ ! -f $MODEL_PATH ]; then
curl -L https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium.bin \
-o $MODELPATH.part && mv $MODELPATH.part $MODEL_PATH
else
echo 模型已存在于 $MODEL_PATH — 保持不变。
fi
command -v python3 yt-dlp ffmpeg whisper-cli
ls -lh $MODEL_PATH
上述操作的作用:
- - 安装 yt-dlp、ffmpeg 和 whisper-cli
- 如果默认模型目录不存在,则创建此技能使用的默认模型目录:~/.openclaw/workspace
- 仅在默认Whisper模型文件缺失时下载
- 避免修改 ~/.openclaw/openclaw.json 或任何其他OpenClaw配置文件
- 不会删除、替换或覆盖现有工作区文件夹中的其他文件
- 验证所需的二进制文件和模型文件是否存在
如果希望将模型存储在其他位置,运行工作流时请传递 --models-dir /path/to/models 参数。
示例请求
- - 将此YouTube视频摘要为Markdown格式。
- 为此YouTube链接生成转录文本和精炼摘要。
- 使用我的浏览器Cookie处理此私有YouTube视频。
- 批量摘要这些YouTube链接,并提供转录文本和摘要文件。
- 有字幕时使用字幕,否则本地转录。
- 为此英文YouTube视频创建中文摘要。
快速开始
单个视频
bash
python3 scripts/runyoutubeworkflow.py https://www.youtube.com/watch?v=VIDEO_ID
这将创建一个专用的每个视频文件夹,写入原始转录Markdown文件,创建摘要占位Markdown文件,并打印描述输出结果的JSON,以及完成摘要步骤所需的确切后续命令/提示。
重要提示:工作流脚本本身并非最终交付物。当前的OpenClaw会话仍需:
- 1. 如果工作流将语言留为 unknown,则推断/回填语言
- 用真实的精炼摘要覆盖占位符 Summary.md
- 运行 scripts/completeyoutubesummary.py 以验证/最终确定结果
强制简体中文摘要
bash
python3 scripts/runyoutubeworkflow.py https://www.youtube.com/watch?v=VIDEO_ID \
--summary-language zh-CN
使用Cookie的受限视频
bash
python3 scripts/runyoutubeworkflow.py https://www.youtube.com/watch?v=VIDEO_ID \
--cookies /path/to/cookies.txt
或
bash
python3 scripts/runyoutubeworkflow.py https://www.youtube.com/watch?v=VIDEO_ID \
--cookies-from-browser chrome
批量/队列模式
参见 references/batch-input-format.md。
bash
python3 scripts/runyoutubeworkflow.py --batch-file ./youtube-urls.txt
此技能的突出优势
此技能旨在应对YouTube的复杂现实情况,始终保持正常工作:
- - 如果视频有手动CC字幕,优先使用
- 如果仅有自动生成的字幕,则使用这些
- 如果完全没有可用字幕,则回退到本地Whisper转录
这使得它比仅依赖字幕的工作流更加可靠。它适用于字幕丰富的视频、字幕稀少的视频以及字幕覆盖不一致的私有/内部上传内容。
核心能力:
- - 首先获取YouTube元数据并推导出安全的输出路径
- 支持单视频模式和批量/队列模式
- 通过优先字幕提取配合本地Whisper回退,处理手动CC字幕、自动生成字幕或无字幕情况
- 通过Cookie或浏览器Cookie提取支持受限/私有视频
- 在摘要前对嘈杂的转录文本进行规范化
- 创建摘要占位文件,用最终摘要覆盖,并完成端到端计时
- 除非明确指示,否则仅清理工作流创建的已知中间文件
此技能的输出
为每个视频创建一个专用输出文件夹,包含以下最终交付物:
- - SANITIZEDVIDEONAMEtranscriptraw.md
- SANITIZEDVIDEONAME_Summary.md
默认情况下,仅删除工作流创建的已知中间媒体、字幕和WAV文件。不删除视频文件夹中可能已存在的无关文件。
所需的本地工具
运行工作流前请验证这些工具是否存在:
- - yt-dlp
- ffmpeg
- whisper-cli
- python3
工作流还要求在配置的模型目录中存在受支持的Whisper ggml模型文件。
捆绑脚本
直接使用以下脚本:
- - scripts/runyoutubeworkflow.py — 用于元数据、下载/字幕、转录、占位摘要创建、清理和工作流元数据输出的主要确定性工作流
- scripts/backfilldetectedlanguage.py — 在当前会话LLM确定主要转录语言后,更新 transcriptraw.md、Summary.md 和工作流元数据
- scripts/completeyoutubesummary.py — 验证 Summary.md 不再是占位符,可选回填语言,计算单个项目的最终端到端计时报告,并输出会话就绪的结果块
- scripts/normalizetranscripttext.py — 将带时间戳的原始转录文本转换为更干净的摘要输入,不修改原始转录文件
- scripts/finalizeyoutubesummary.py — 完成流程使用的底层计时辅助工具
- scripts/preparevideo_paths.py — 从标题和视频ID推导出经过净化的文件夹和输出文件路径
有用的参考资料:
- - references/detailed-workflow.md — 完整的操作工作流、完成规则、批量指导、命名规则和实用说明
- references/summary-template.md — 最终 Summary.md 所需的结构和编写规则
- references/session-output-template.md — 完成后返回当前OpenClaw会话所需的面向用户的输出格式
- references/batch-input-format.md — 队列/批量处理的输入格式
默认设置
- - 默认父级输出文件夹:~/Downloads
- 默认Whisper模型:ggml-medium
- 支持的Whisper模型:ggml-base、ggml-small、ggml-medium
- 默认媒体模式:仅音频
- 默认转录语言:如需转录则自动检测
- 默认摘要语言:source
- 原始转录保留时间戳
公共工作流概述
从高层次来看,此技能执行以下操作:
- 1. 首先获取元数据并创建安全的输出路径
- 尝试手动字幕,然后自动字幕,最后本地Whisper回退
- 写入 SANITIZEDVIDEONAMEtranscriptraw.md
- 创建 SANITIZEDVIDEONAMESummary.md 作为占位符
- 让当前OpenClaw会话用真实摘要覆盖占位符
- 运行 scripts/completeyoutube_summary.py 以验证完成情况,必要时回填语言,并输出会话就绪的结果块
完成标准
对于正常的端到端请求,完成意味着以下所有条件均满足:
- 1. 工作流脚本成功运行
- 如果语言最初为 unknown,则语言已回填到两个Markdown文件中
- 占位摘要文件已被真实摘要覆盖