YouTube Model Feeder
Food for your model.
Stop pausing videos every 30 seconds to screenshot, paste into Obsidian, and caption. Every 20-minute tutorial shouldn't take an hour to document.
YouTube Model Feeder extracts everything from a YouTube video — timestamped transcript, key frame snapshots, OCR of code and slides, presentation slide detection, and LLM-generated summaries — and packages it into structured knowledge your AI assistant can search, reference, and reason about.
Why This Exists
The problem isn't transcription — ten tools do that. The problem is structured context. When you feed a raw transcript to a model, it has no visual context. It doesn't know what was on screen when the speaker said "as you can see here." It can't read the code in the terminal, the diagram on the slide, or the config file being edited.
YouTube Model Feeder captures all of that. The output isn't just text — it's a knowledge bundle: transcript segments aligned to timestamps, screenshots of every key moment, OCR text from code snippets and slides, and an LLM summary that ties it all together.
Combined with obsidian-semantic-search (also on ClawHub), every video you watch becomes permanently searchable by meaning in your Obsidian vault.
What It Extracts
Full Pipeline
| Step | Tool | What it produces |
|---|
| Download | yt-dlp | Video + audio + metadata (title, duration, thumbnail) |
| Transcribe |
Whisper (Ollama) or YouTube captions | Timestamped transcript segments |
|
Frame Extraction | FFmpeg | Key frame snapshots every 5s (configurable) |
|
Slide Detection | SSIM analysis (OpenCV) | Identifies presentation slides via structural similarity between frames |
|
OCR | Tesseract | Reads code, terminal output, and text from captured frames |
|
LLM Summary | Ollama / OpenAI / Anthropic | Structured markdown with sections, code blocks, and key takeaways |
Slide Detection (Deep)
Not just frame captures — intelligent slide boundary detection:
- 1. Layout detection — classifies video as full-frame, picture-in-picture, or split panel
- SSIM transition scan — compares consecutive frames for structural changes (threshold: SSIM < 0.85)
- LLM disambiguation — borderline transitions (0.85–0.93 SSIM) sent to LLM for classification
- Slide grouping — merges transitions into slides with enforced minimum duration (3s)
- Final-state capture — saves the last frame of each slide as JPEG
- OCR extraction — runs Tesseract on each slide image
- Transcript alignment — maps transcript segments to slide time ranges
Output Formats
| Format | What you get |
|---|
| Markdown | Timestamped sections with headings, code blocks, image references |
| HTML |
Styled single-page doc with embedded screenshots |
|
Obsidian bundle | ZIP export: markdown + images, ready to drop into your vault |
Installation
Prerequisites
CODEBLOCK0
Docker Desktop must be running for the full backend.
Start the Stack
CODEBLOCK1
This starts 5 services:
| Service | Port | Purpose |
|---|
| api | 8000 | FastAPI backend + Swagger docs at INLINECODE0 |
| celery_worker |
— | Background video processing |
|
postgres | 5432 | Job tracking, transcripts, documents |
|
redis | 6379 | Task queue (Celery broker) |
|
web | 3000 | Next.js frontend (optional) |
Verify
Open http://localhost:8000/docs — you should see the Swagger API documentation.
Usage
Via AI Assistant
Extract a video:
"Extract everything from this YouTube video and save it to my vault: https://youtube.com/watch?v=..."
Transcript only:
"Get the timestamped transcript for this video"
Slides and code screenshots:
"Extract all the code screenshots and presentation slides from this tutorial"
Obsidian export:
"Convert this video into an Obsidian note with screenshots and timestamps"
Via API
CODEBLOCK2
Via Web UI
Open http://localhost:3000, paste a YouTube URL, and watch the extraction happen in real time with progress tracking.
LLM Provider Selection
Per-user configuration — choose your summarization engine:
| Provider | Model (default) | Setup | Cost |
|---|
| Ollama (default) | Mistral 7B | Pre-installed locally | Free |
| OpenAI |
GPT-4o-mini | Set
OPENAI_API_KEY | Per-token |
|
Anthropic | Claude Sonnet 4.6 | Set
ANTHROPIC_API_KEY | Per-token |
Configure via the API: PATCH /settings/me with your preferred provider and API key (encrypted at rest with Fernet).
The Knowledge Pipeline
YouTube Model Feeder is designed to work with other ClawHub skills:
CODEBLOCK3
Every video becomes permanent, searchable knowledge. Not buried in a playlist — indexed and queryable.
Architecture
CODEBLOCK4
Troubleshooting
| Problem | Fix |
|---|
| VirusTotal "suspicious" warning on install | False positive — skill describes video extraction patterns. Use INLINECODE6 |
| Docker services won't start |
Ensure Docker Desktop is running. Check
docker-compose logs api for errors |
| Transcription is slow | First run pulls the Whisper model (~1.5 GB). Subsequent runs are fast. Try YouTube captions first (faster, no model needed) |
| No slides detected | SSIM threshold may need tuning for your video type. Presentation-style videos work best |
| LLM summary is empty | Check LLM provider config. Default is Ollama — ensure Ollama is running with a model pulled |
| FFmpeg not found |
brew install ffmpeg (macOS) or
apt install ffmpeg (Linux) |
Links
- - Source: https://github.com/celstnblacc/youtube-model-feeder
- Obsidian Semantic Search: https://clawhub.ai/skills/obsidian-semantic-search
- License: MIT-0 (this skill) / Apache 2.0 (source)
Built by celstnblacc — food for your model. 226 tests, 6 extraction stages, 3 LLM providers, Obsidian-ready output.
技能名称: YouTube模型投喂器
YouTube模型投喂器
为你的模型提供养料。
无需每30秒暂停视频截图、粘贴到Obsidian并添加说明。20分钟的教程本不应花费一小时来记录。
YouTube模型投喂器从YouTube视频中提取所有内容——带时间戳的转录文本、关键帧截图、代码和幻灯片的OCR识别、演示文稿幻灯片检测以及LLM生成的摘要——并将其打包成结构化知识,供你的AI助手搜索、引用和推理。
为何存在
问题不在于转录——有十种工具能做到。问题在于结构化上下文。当你将原始转录文本输入模型时,它缺乏视觉上下文。它不知道当演讲者说正如你在这里看到的时屏幕上显示的是什么。它无法读取终端中的代码、幻灯片上的图表或被编辑的配置文件。
YouTube模型投喂器捕捉了所有这些信息。输出不仅仅是文本——它是一个知识包:与时间戳对齐的转录片段、每个关键时刻的截图、代码片段和幻灯片的OCR文本,以及将所有内容串联起来的LLM摘要。
结合obsidian-semantic-search(同样在ClawHub上),你观看的每个视频都将通过语义在Obsidian库中永久可搜索。
提取内容
完整流程
| 步骤 | 工具 | 产出内容 |
|---|
| 下载 | yt-dlp | 视频+音频+元数据(标题、时长、缩略图) |
| 转录 |
Whisper(Ollama)或YouTube字幕 | 带时间戳的转录片段 |
|
帧提取 | FFmpeg | 每5秒的关键帧截图(可配置) |
|
幻灯片检测 | SSIM分析(OpenCV) | 通过帧间结构相似性识别演示文稿幻灯片 |
|
OCR | Tesseract | 读取捕获帧中的代码、终端输出和文本 |
|
LLM摘要 | Ollama / OpenAI / Anthropic | 包含章节、代码块和关键要点的结构化Markdown |
幻灯片检测(深度)
不仅仅是帧捕获——智能幻灯片边界检测:
- 1. 布局检测——将视频分类为全屏、画中画或分屏面板
- SSIM过渡扫描——比较连续帧的结构变化(阈值:SSIM < 0.85)
- LLM消歧——边界过渡(0.85–0.93 SSIM)发送给LLM进行分类
- 幻灯片分组——将过渡合并为幻灯片,强制执行最短时长(3秒)
- 最终状态捕获——将每张幻灯片的最后一帧保存为JPEG
- OCR提取——对每张幻灯片图像运行Tesseract
- 转录对齐——将转录片段映射到幻灯片时间范围
输出格式
| 格式 | 你得到的内容 |
|---|
| Markdown | 带时间戳的章节、标题、代码块、图片引用 |
| HTML |
带有嵌入式截图的样式化单页文档 |
|
Obsidian包 | ZIP导出:Markdown+图片,可直接放入你的库 |
安装
前置条件
bash
macOS
brew install ffmpeg tesseract
Linux
apt install ffmpeg tesseract-ocr
必须运行Docker Desktop才能使用完整后端。
启动堆栈
bash
git clone https://github.com/celstnblacc/youtube-model-feeder.git
cd youtube-model-feeder
docker-compose up -d
这将启动5个服务:
| 服务 | 端口 | 用途 |
|---|
| api | 8000 | FastAPI后端 + Swagger文档(/docs) |
| celery_worker |
— | 后台视频处理 |
|
postgres | 5432 | 任务跟踪、转录文本、文档 |
|
redis | 6379 | 任务队列(Celery代理) |
|
web | 3000 | Next.js前端(可选) |
验证
打开http://localhost:8000/docs——你应该能看到Swagger API文档。
使用方法
通过AI助手
提取视频:
从该YouTube视频中提取所有内容并保存到我的库中:https://youtube.com/watch?v=...
仅转录文本:
获取该视频的带时间戳转录文本
幻灯片和代码截图:
提取本教程中的所有代码截图和演示文稿幻灯片
Obsidian导出:
将该视频转换为带有截图和时间戳的Obsidian笔记
通过API
bash
提交视频进行处理
curl -X POST http://localhost:8000/jobs \
-H Content-Type: application/json \
-d {url: https://youtube.com/watch?v=dQw4w9WgXcQ}
检查任务状态
curl http://localhost:8000/jobs/{job_id}
获取生成的文档
curl http://localhost:8000/videos/{video_id}
通过Web界面
打开http://localhost:3000,粘贴YouTube URL,即可实时观看提取过程并跟踪进度。
LLM提供商选择
按用户配置——选择你的摘要引擎:
| 提供商 | 模型(默认) | 设置 | 成本 |
|---|
| Ollama(默认) | Mistral 7B | 本地预安装 | 免费 |
| OpenAI |
GPT-4o-mini | 设置OPENAI
APIKEY | 按token计费 |
|
Anthropic | Claude Sonnet 4.6 | 设置ANTHROPIC
APIKEY | 按token计费 |
通过API配置:PATCH /settings/me,使用你偏好的提供商和API密钥(使用Fernet加密存储)。
知识管道
YouTube模型投喂器设计为与其他ClawHub技能协同工作:
YouTube视频
→ youtube-model-feeder(提取转录文本+截图+OCR+摘要)
→ Obsidian库(结构化Markdown+图片)
→ obsidian-semantic-search(使用嵌入索引,按语义搜索)
→ 你的AI助手(你观看过的每个视频的完整上下文)
每个视频都成为永久、可搜索的知识。不再埋没在播放列表中——而是被索引和可查询。
架构
youtube-model-feeder/
├── backend/ # FastAPI + Celery(Python 3.12)
│ ├── app/
│ │ ├── services/ # 核心提取管道(6372行代码)
│ │ │ ├── youtube.py # yt-dlp封装器
│ │ │ ├── transcript.py # Whisper + YouTube字幕
│ │ │ ├── snapshot.py # FFmpeg帧提取
│ │ │ ├── slide_detection.py # SSIM + OCR + 对齐(643行代码)
│ │ │ ├── llm.py # 摘要引擎(1229行代码)
│ │ │ └── llm_providers.py # Ollama/OpenAI/Anthropic抽象层
│ │ ├── routes/ # REST API端点
│ │ └── tasks.py # Celery后台任务
├── frontend/ # Next.js 14 + React 18 + Tailwind + Zustand
├── docker-compose.yml # 5个服务
└── migrations/ # Alembic数据库迁移
故障排除
| 问题 | 解决方法 |
|---|
| 安装时VirusTotal显示可疑警告 | 误报——该技能描述视频提取模式。使用clawhub install youtube-model-feeder --force |
| Docker服务无法启动 |
确保Docker Desktop正在运行。检查docker-compose logs api的错误信息 |
| 转录速度慢 | 首次运行会拉取Whisper模型(约1.5 GB)。后续运行会很快。尝试先使用YouTube字幕(更快,无需模型) |
| 未检测到幻灯片 | 可能需要针对你的视频类型调整SSIM阈值。演示文稿风格视频效果最佳 |
| LLM摘要为空 | 检查LLM提供商配置。默认为Ollama——确保Ollama正在运行且已拉取模型 |
| 未找到FFmpeg | brew install ffmpeg(macOS)或apt install ffmpeg(Linux) |
链接
- - 源代码: https://github.com/celstnblacc/youtube-model-feeder
- Obsidian语义搜索: https://clawhub.ai/skills/obsidian-semantic-search
- 许可证: MIT-0(本技能)/ Apache 2.0(源代码)
*由
celstnblacc构建——为你的模型提供养料。226项测试,6个提取阶段,3个LLM提供商,