YouTube Model Feeder

Food for your model.

Stop pausing videos every 30 seconds to screenshot, paste into Obsidian, and caption. Every 20-minute tutorial shouldn't take an hour to document.

YouTube Model Feeder extracts everything from a YouTube video — timestamped transcript, key frame snapshots, OCR of code and slides, presentation slide detection, and LLM-generated summaries — and packages it into structured knowledge your AI assistant can search, reference, and reason about.

Why This Exists

The problem isn't transcription — ten tools do that. The problem is structured context. When you feed a raw transcript to a model, it has no visual context. It doesn't know what was on screen when the speaker said "as you can see here." It can't read the code in the terminal, the diagram on the slide, or the config file being edited.

YouTube Model Feeder captures all of that. The output isn't just text — it's a knowledge bundle: transcript segments aligned to timestamps, screenshots of every key moment, OCR text from code snippets and slides, and an LLM summary that ties it all together.

Combined with obsidian-semantic-search (also on ClawHub), every video you watch becomes permanently searchable by meaning in your Obsidian vault.

What It Extracts

Full Pipeline

Step	Tool	What it produces
Download	yt-dlp	Video + audio + metadata (title, duration, thumbnail)
Transcribe

Slide Detection (Deep)

Not just frame captures — intelligent slide boundary detection:

1. Layout detection — classifies video as full-frame, picture-in-picture, or split panel
SSIM transition scan — compares consecutive frames for structural changes (threshold: SSIM < 0.85)
LLM disambiguation — borderline transitions (0.85–0.93 SSIM) sent to LLM for classification
Slide grouping — merges transitions into slides with enforced minimum duration (3s)
Final-state capture — saves the last frame of each slide as JPEG
OCR extraction — runs Tesseract on each slide image
Transcript alignment — maps transcript segments to slide time ranges

Output Formats

Format	What you get
Markdown	Timestamped sections with headings, code blocks, image references
HTML

Styled single-page doc with embedded screenshots | | Obsidian bundle | ZIP export: markdown + images, ready to drop into your vault |

Installation

Prerequisites

CODEBLOCK0

Docker Desktop must be running for the full backend.

Start the Stack

CODEBLOCK1

This starts 5 services:

Service	Port	Purpose
api	8000	FastAPI backend + Swagger docs at INLINECODE0
celery_worker

— | Background video processing |
| postgres | 5432 | Job tracking, transcripts, documents |
| redis | 6379 | Task queue (Celery broker) |
| web | 3000 | Next.js frontend (optional) |

Verify

Open http://localhost:8000/docs — you should see the Swagger API documentation.

Usage

Via AI Assistant

Extract a video:

"Extract everything from this YouTube video and save it to my vault: https://youtube.com/watch?v=..."

Transcript only:

"Get the timestamped transcript for this video"

Slides and code screenshots:

"Extract all the code screenshots and presentation slides from this tutorial"

Obsidian export:

"Convert this video into an Obsidian note with screenshots and timestamps"

Via API

CODEBLOCK2

Via Web UI

Open http://localhost:3000, paste a YouTube URL, and watch the extraction happen in real time with progress tracking.

LLM Provider Selection

Per-user configuration — choose your summarization engine:

Provider	Model (default)	Setup	Cost
Ollama (default)	Mistral 7B	Pre-installed locally	Free
OpenAI

Configure via the API: PATCH /settings/me with your preferred provider and API key (encrypted at rest with Fernet).

The Knowledge Pipeline

YouTube Model Feeder is designed to work with other ClawHub skills:

CODEBLOCK3

Every video becomes permanent, searchable knowledge. Not buried in a playlist — indexed and queryable.

Architecture

CODEBLOCK4

Troubleshooting

Problem	Fix
VirusTotal "suspicious" warning on install	False positive — skill describes video extraction patterns. Use INLINECODE6
Docker services won't start

Ensure Docker Desktop is running. Check docker-compose logs api for errors | | Transcription is slow | First run pulls the Whisper model (~1.5 GB). Subsequent runs are fast. Try YouTube captions first (faster, no model needed) | | No slides detected | SSIM threshold may need tuning for your video type. Presentation-style videos work best | | LLM summary is empty | Check LLM provider config. Default is Ollama — ensure Ollama is running with a model pulled | | FFmpeg not found | brew install ffmpeg (macOS) or apt install ffmpeg (Linux) |

YouTube模型投喂器

为你的模型提供养料。

无需每30秒暂停视频截图、粘贴到Obsidian并添加说明。20分钟的教程本不应花费一小时来记录。

YouTube模型投喂器从YouTube视频中提取所有内容——带时间戳的转录文本、关键帧截图、代码和幻灯片的OCR识别、演示文稿幻灯片检测以及LLM生成的摘要——并将其打包成结构化知识，供你的AI助手搜索、引用和推理。

为何存在

问题不在于转录——有十种工具能做到。问题在于结构化上下文。当你将原始转录文本输入模型时，它缺乏视觉上下文。它不知道当演讲者说正如你在这里看到的时屏幕上显示的是什么。它无法读取终端中的代码、幻灯片上的图表或被编辑的配置文件。

YouTube模型投喂器捕捉了所有这些信息。输出不仅仅是文本——它是一个知识包：与时间戳对齐的转录片段、每个关键时刻的截图、代码片段和幻灯片的OCR文本，以及将所有内容串联起来的LLM摘要。

结合obsidian-semantic-search（同样在ClawHub上），你观看的每个视频都将通过语义在Obsidian库中永久可搜索。

提取内容

完整流程

步骤	工具	产出内容
下载	yt-dlp	视频+音频+元数据（标题、时长、缩略图）
转录

幻灯片检测（深度）

不仅仅是帧捕获——智能幻灯片边界检测：

1. 布局检测——将视频分类为全屏、画中画或分屏面板
SSIM过渡扫描——比较连续帧的结构变化（阈值：SSIM < 0.85）
LLM消歧——边界过渡（0.85–0.93 SSIM）发送给LLM进行分类
幻灯片分组——将过渡合并为幻灯片，强制执行最短时长（3秒）
最终状态捕获——将每张幻灯片的最后一帧保存为JPEG
OCR提取——对每张幻灯片图像运行Tesseract
转录对齐——将转录片段映射到幻灯片时间范围

输出格式

格式	你得到的内容
Markdown	带时间戳的章节、标题、代码块、图片引用
HTML

带有嵌入式截图的样式化单页文档 | | Obsidian包 | ZIP导出：Markdown+图片，可直接放入你的库 |

安装

前置条件

bash

macOS

brew install ffmpeg tesseract

Linux

apt install ffmpeg tesseract-ocr

必须运行Docker Desktop才能使用完整后端。

启动堆栈

bash
git clone https://github.com/celstnblacc/youtube-model-feeder.git
cd youtube-model-feeder
docker-compose up -d

这将启动5个服务：

服务	端口	用途
api	8000	FastAPI后端 + Swagger文档（/docs）
celery_worker

— | 后台视频处理 |
| postgres | 5432 | 任务跟踪、转录文本、文档 |
| redis | 6379 | 任务队列（Celery代理） |
| web | 3000 | Next.js前端（可选） |

验证

打开http://localhost:8000/docs——你应该能看到Swagger API文档。

使用方法

通过AI助手

提取视频：

从该YouTube视频中提取所有内容并保存到我的库中：https://youtube.com/watch?v=...

仅转录文本：

获取该视频的带时间戳转录文本

幻灯片和代码截图：

提取本教程中的所有代码截图和演示文稿幻灯片

Obsidian导出：

将该视频转换为带有截图和时间戳的Obsidian笔记

通过API

bash

提交视频进行处理

curl -X POST http://localhost:8000/jobs \
-H Content-Type: application/json \
-d {url: https://youtube.com/watch?v=dQw4w9WgXcQ}

检查任务状态

curl http://localhost:8000/jobs/{job_id}

获取生成的文档

curl http://localhost:8000/videos/{video_id}

通过Web界面

打开http://localhost:3000，粘贴YouTube URL，即可实时观看提取过程并跟踪进度。

LLM提供商选择

按用户配置——选择你的摘要引擎：

提供商	模型（默认）	设置	成本
Ollama（默认）	Mistral 7B	本地预安装	免费
OpenAI

通过API配置：PATCH /settings/me，使用你偏好的提供商和API密钥（使用Fernet加密存储）。

知识管道

YouTube模型投喂器设计为与其他ClawHub技能协同工作：

YouTube视频
→ youtube-model-feeder（提取转录文本+截图+OCR+摘要）
→ Obsidian库（结构化Markdown+图片）
→ obsidian-semantic-search（使用嵌入索引，按语义搜索）
→ 你的AI助手（你观看过的每个视频的完整上下文）

每个视频都成为永久、可搜索的知识。不再埋没在播放列表中——而是被索引和可查询。

架构

youtube-model-feeder/
├── backend/ # FastAPI + Celery（Python 3.12）
│ ├── app/
│ │ ├── services/ # 核心提取管道（6372行代码）
│ │ │ ├── youtube.py # yt-dlp封装器
│ │ │ ├── transcript.py # Whisper + YouTube字幕
│ │ │ ├── snapshot.py # FFmpeg帧提取
│ │ │ ├── slide_detection.py # SSIM + OCR + 对齐（643行代码）
│ │ │ ├── llm.py # 摘要引擎（1229行代码）
│ │ │ └── llm_providers.py # Ollama/OpenAI/Anthropic抽象层
│ │ ├── routes/ # REST API端点
│ │ └── tasks.py # Celery后台任务
├── frontend/ # Next.js 14 + React 18 + Tailwind + Zustand
├── docker-compose.yml # 5个服务
└── migrations/ # Alembic数据库迁移

故障排除

问题	解决方法
安装时VirusTotal显示可疑警告	误报——该技能描述视频提取模式。使用clawhub install youtube-model-feeder --force
Docker服务无法启动

链接

- 源代码： https://github.com/celstnblacc/youtube-model-feeder
Obsidian语义搜索： https://clawhub.ai/skills/obsidian-semantic-search
许可证： MIT-0（本技能）/ Apache 2.0（源代码）

*由celstnblacc构建——为你的模型提供养料。226项测试，6个提取阶段，3个LLM提供商，

YouTube Model FeederYouTube模型喂食器