YouTube AnyCaption Summarizer

The YouTube summarizer that still works when captions are broken, missing, or inconsistent.

Outputs: raw markdown transcript + polished markdown summary + session-ready result block.

Unlike caption-only tools, this skill still works when subtitles are missing by falling back to local Whisper transcription.

Generate a raw transcript markdown file and a polished summary markdown file from one or more YouTube videos.

This skill is self-contained. It does not require any other YouTube summarizer skill or prior workflow context.

Best for

- founder videos, operator walkthroughs, and technical explainers
long tutorial videos that need transcript + implementation summary
private/internal YouTube uploads that may require cookies
mixed-caption environments where some videos have CC, some only have auto-captions, and some have no usable subtitles
batch research workflows where many YouTube links need standardized markdown outputs
users who want reliable markdown artifacts, not just a one-off chat summary

Why choose this over simpler transcript skills?

- manual CC first, auto-captions second, local Whisper fallback last
keeps working when subtitle coverage is weak or missing
supports private/restricted YouTube videos via cookies
returns durable markdown artifacts, not just chat text
supports batch processing and session-ready completion reporting

Install dependencies

For a fresh macOS setup, new users should be able to copy-paste the following exactly:

CODEBLOCK0

What this does:

- installs yt-dlp, ffmpeg, and INLINECODE2
creates the default models directory used by this skill if it does not already exist: INLINECODE3
downloads the default Whisper model file only if it is missing
avoids touching ~/.openclaw/openclaw.json or any other OpenClaw config file
does not delete, replace, or overwrite other files in your existing workspace folder
verifies that the required binaries and model file are present

If you want to store models elsewhere, pass --models-dir /path/to/models when running the workflow.

Example requests

- “Summarize this YouTube video into markdown.”
“Generate a transcript and polished summary for this YouTube link.”
“Process this private YouTube video with my browser cookies.”
“Batch summarize these YouTube links and give me transcript + summary files.”
“Use subtitles when available, otherwise transcribe locally.”
“Create a Chinese summary from this English YouTube video.”

Quick start

Single video

CODEBLOCK1

This creates a dedicated per-video folder, writes the raw transcript markdown, creates the summary placeholder markdown, and prints JSON describing the outputs plus the exact follow-up commands/prompts needed to finish the summary step.

Important: the workflow script alone is not the finished deliverable. The current OpenClaw session must still:

1. infer/backfill the language if the workflow left it as INLINECODE6
overwrite the placeholder Summary.md with a real polished summary
run scripts/complete_youtube_summary.py to validate/finalize the result

Force simplified Chinese summary

CODEBLOCK2

Restricted video with cookies

CODEBLOCK3

CODEBLOCK4

Batch / queue mode

See references/batch-input-format.md.

CODEBLOCK5

Why this skill stands out

This skill is designed to keep working across the messy reality of YouTube:

- if a video has manual closed captions (CC), use them first
if it only has auto-generated subtitles, use those next
if it has no usable subtitles at all, fall back to local Whisper transcription

That makes it materially more reliable than caption-only workflows. It works well for caption-rich videos, caption-poor videos, and private/internal uploads where subtitle coverage is inconsistent.

Core capabilities:

- fetch YouTube metadata first and derive safe output paths
support single-video mode and batch / queue mode
handle manual CC, auto-generated subtitles, or no subtitles via subtitle-first extraction with local Whisper fallback
support restricted/private videos via cookies or browser-cookie extraction
normalize noisy transcript text before summarization
create a placeholder summary file, overwrite it with the final summary, and finalize end-to-end timing
clean up only known intermediates created by the workflow unless explicitly told otherwise

What this skill produces

For each video, create exactly one dedicated output folder containing these final deliverables:

- INLINECODE10
INLINECODE11

By default, delete only the known intermediate media, subtitle, and WAV files created by the workflow. Do not wipe unrelated files that may already exist in the per-video folder.

Required local tools

Verify these tools exist before running the workflow:

- INLINECODE12
INLINECODE13
INLINECODE14
INLINECODE15

The workflow also requires a supported Whisper ggml model file in the configured models directory.

Bundled scripts

Use these scripts directly:

- scripts/run_youtube_workflow.py — main deterministic workflow for metadata, download/subtitles, transcription, placeholder summary creation, cleanup, and workflow metadata emission
INLINECODE17 — update transcript_raw.md, Summary.md, and workflow metadata after the current session LLM decides the major transcript language
INLINECODE20 — validate that Summary.md is no longer a placeholder, optionally backfill language, compute the final end-to-end timing report for one item, and emit a session-ready result block
INLINECODE22 — convert raw timestamped transcript text into cleaner summary input without modifying the raw transcript file
INLINECODE23 — lower-level timing helper used by the completion flow
INLINECODE24 — derive sanitized folder and output file paths from a title and video ID

Useful references:

- references/detailed-workflow.md — full operational workflow, completion rules, batch guidance, naming rules, and practical notes
INLINECODE26 — required structure and writing rules for the final INLINECODE27
INLINECODE28 — required user-facing output format to return to the current OpenClaw session after completion
INLINECODE29 — input format for queue / batch processing

Defaults

- Default parent output folder: INLINECODE30
Default whisper model: INLINECODE31
Supported whisper models: ggml-base, ggml-small, INLINECODE34
Default media mode: audio-only
Default transcript language: auto-detect if transcription is needed
Default summary language: INLINECODE35
Raw transcript keeps timestamps

Public workflow overview

At a high level, the skill does this:

1. fetch metadata first and create safe output paths
try manual subtitles, then auto-captions, then local Whisper fallback
write INLINECODE36
create SANITIZED_VIDEO_NAME_Summary.md as a placeholder
have the current OpenClaw session overwrite the placeholder with a real summary
run scripts/complete_youtube_summary.py to validate completion, backfill language if needed, and emit a session-ready result block

What counts as completion

For a normal end-to-end request, completion means all of the following are true:

1. the workflow script succeeded
if language was initially unknown, the language was backfilled into both markdown files
the placeholder summary file was overwritten with a real summary
INLINECODE40 was run successfully
the user received the resulting output paths and timing/result status

If the workflow script succeeded but the summary/completion step did not happen yet, describe the state as partial/in-progress rather than complete.

When to read the deeper references

Read these as needed:

- references/detailed-workflow.md when you need the full implementation contract, batch guidance, naming rules, cleanup rules, timing flow, or debugging details
INLINECODE42 before writing the final polished INLINECODE43
INLINECODE44 before returning the final user-facing per-video result block
INLINECODE45 when handling INLINECODE46

Practical public promise

This skill is optimized for dependable end-to-end output, not just quick transcript extraction:

- raw transcript markdown
polished summary markdown
session-ready completion report

YouTube AnyCaption Summarizer

即使在字幕损坏、缺失或不一致的情况下，依然能正常工作的YouTube摘要工具。

输出：原始Markdown转录文本 + 精炼Markdown摘要 + 会话就绪结果块。

与仅依赖字幕的工具不同，当字幕缺失时，此技能会回退到本地Whisper转录，因此依然能正常工作。

从一个或多个YouTube视频生成原始转录Markdown文件和精炼摘要Markdown文件。

此技能是自包含的。它不需要任何其他YouTube摘要技能或先前的工作流上下文。

最佳适用场景

- 创始人视频、操作演示和技术讲解
需要转录文本+实现摘要的长教程视频
可能需要Cookie的私有/内部YouTube上传内容
混合字幕环境：部分视频有CC字幕，部分仅有自动字幕，部分完全没有可用字幕
需要批量处理多个YouTube链接并输出标准化Markdown的研究工作流
希望获得可靠Markdown制品（而非一次性聊天摘要）的用户

为何选择此技能而非更简单的转录技能？

- 优先使用手动CC字幕，其次自动字幕，最后本地Whisper回退
在字幕覆盖薄弱或缺失时仍能正常工作
通过Cookie支持私有/受限YouTube视频
返回持久的Markdown制品，而非仅聊天文本
支持批量处理和会话就绪的完成报告

安装依赖

对于全新的macOS环境，新用户可直接复制粘贴以下命令：

bash
brew install yt-dlp ffmpeg whisper-cpp
MODELS_DIR=$HOME/.openclaw/workspace
MODELPATH=$MODELSDIR/ggml-medium.bin
mkdir -p $MODELS_DIR
if [ ! -f $MODEL_PATH ]; then
curl -L https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium.bin \
-o $MODELPATH.part && mv $MODELPATH.part $MODEL_PATH
else
echo 模型已存在于 $MODEL_PATH — 保持不变。
fi
command -v python3 yt-dlp ffmpeg whisper-cli
ls -lh $MODEL_PATH

上述操作的作用：

- 安装 yt-dlp、ffmpeg 和 whisper-cli
如果默认模型目录不存在，则创建此技能使用的默认模型目录：~/.openclaw/workspace
仅在默认Whisper模型文件缺失时下载
避免修改 ~/.openclaw/openclaw.json 或任何其他OpenClaw配置文件
不会删除、替换或覆盖现有工作区文件夹中的其他文件
验证所需的二进制文件和模型文件是否存在

如果希望将模型存储在其他位置，运行工作流时请传递 --models-dir /path/to/models 参数。

示例请求

- 将此YouTube视频摘要为Markdown格式。
为此YouTube链接生成转录文本和精炼摘要。
使用我的浏览器Cookie处理此私有YouTube视频。
批量摘要这些YouTube链接，并提供转录文本和摘要文件。
有字幕时使用字幕，否则本地转录。
为此英文YouTube视频创建中文摘要。

快速开始

单个视频

bash
python3 scripts/runyoutubeworkflow.py https://www.youtube.com/watch?v=VIDEO_ID

这将创建一个专用的每个视频文件夹，写入原始转录Markdown文件，创建摘要占位Markdown文件，并打印描述输出结果的JSON，以及完成摘要步骤所需的确切后续命令/提示。

重要提示：工作流脚本本身并非最终交付物。当前的OpenClaw会话仍需：

1. 如果工作流将语言留为 unknown，则推断/回填语言
用真实的精炼摘要覆盖占位符 Summary.md
运行 scripts/completeyoutubesummary.py 以验证/最终确定结果

强制简体中文摘要

bash
python3 scripts/runyoutubeworkflow.py https://www.youtube.com/watch?v=VIDEO_ID \
--summary-language zh-CN

使用Cookie的受限视频

bash
python3 scripts/runyoutubeworkflow.py https://www.youtube.com/watch?v=VIDEO_ID \
--cookies /path/to/cookies.txt

或

bash
python3 scripts/runyoutubeworkflow.py https://www.youtube.com/watch?v=VIDEO_ID \
--cookies-from-browser chrome

批量/队列模式

参见 references/batch-input-format.md。

bash
python3 scripts/runyoutubeworkflow.py --batch-file ./youtube-urls.txt

此技能的突出优势

此技能旨在应对YouTube的复杂现实情况，始终保持正常工作：

- 如果视频有手动CC字幕，优先使用
如果仅有自动生成的字幕，则使用这些
如果完全没有可用字幕，则回退到本地Whisper转录

这使得它比仅依赖字幕的工作流更加可靠。它适用于字幕丰富的视频、字幕稀少的视频以及字幕覆盖不一致的私有/内部上传内容。

核心能力：

- 首先获取YouTube元数据并推导出安全的输出路径
支持单视频模式和批量/队列模式
通过优先字幕提取配合本地Whisper回退，处理手动CC字幕、自动生成字幕或无字幕情况
通过Cookie或浏览器Cookie提取支持受限/私有视频
在摘要前对嘈杂的转录文本进行规范化
创建摘要占位文件，用最终摘要覆盖，并完成端到端计时
除非明确指示，否则仅清理工作流创建的已知中间文件

此技能的输出

为每个视频创建一个专用输出文件夹，包含以下最终交付物：

- SANITIZEDVIDEONAMEtranscriptraw.md
SANITIZEDVIDEONAME_Summary.md

默认情况下，仅删除工作流创建的已知中间媒体、字幕和WAV文件。不删除视频文件夹中可能已存在的无关文件。

所需的本地工具

运行工作流前请验证这些工具是否存在：

- yt-dlp
ffmpeg
whisper-cli
python3

工作流还要求在配置的模型目录中存在受支持的Whisper ggml模型文件。

捆绑脚本

直接使用以下脚本：

- scripts/runyoutubeworkflow.py — 用于元数据、下载/字幕、转录、占位摘要创建、清理和工作流元数据输出的主要确定性工作流
scripts/backfilldetectedlanguage.py — 在当前会话LLM确定主要转录语言后，更新 transcriptraw.md、Summary.md 和工作流元数据
scripts/completeyoutubesummary.py — 验证 Summary.md 不再是占位符，可选回填语言，计算单个项目的最终端到端计时报告，并输出会话就绪的结果块
scripts/normalizetranscripttext.py — 将带时间戳的原始转录文本转换为更干净的摘要输入，不修改原始转录文件
scripts/finalizeyoutubesummary.py — 完成流程使用的底层计时辅助工具
scripts/preparevideo_paths.py — 从标题和视频ID推导出经过净化的文件夹和输出文件路径

有用的参考资料：

- references/detailed-workflow.md — 完整的操作工作流、完成规则、批量指导、命名规则和实用说明
references/summary-template.md — 最终 Summary.md 所需的结构和编写规则
references/session-output-template.md — 完成后返回当前OpenClaw会话所需的面向用户的输出格式
references/batch-input-format.md — 队列/批量处理的输入格式

默认设置

- 默认父级输出文件夹：~/Downloads
默认Whisper模型：ggml-medium
支持的Whisper模型：ggml-base、ggml-small、ggml-medium
默认媒体模式：仅音频
默认转录语言：如需转录则自动检测
默认摘要语言：source
原始转录保留时间戳

公共工作流概述

从高层次来看，此技能执行以下操作：

1. 首先获取元数据并创建安全的输出路径
尝试手动字幕，然后自动字幕，最后本地Whisper回退
写入 SANITIZEDVIDEONAMEtranscriptraw.md
创建 SANITIZEDVIDEONAMESummary.md 作为占位符
让当前OpenClaw会话用真实摘要覆盖占位符
运行 scripts/completeyoutube_summary.py 以验证完成情况，必要时回填语言，并输出会话就绪的结果块

完成标准

对于正常的端到端请求，完成意味着以下所有条件均满足：

1. 工作流脚本成功运行
如果语言最初为 unknown，则语言已回填到两个Markdown文件中
占位摘要文件已被真实摘要覆盖

youtube-anycaption-summarizerYouTube字幕摘要