Bilibili UP to KB
Convert B站 videos (single or entire channels) into cleaned, structured text knowledge bases.
Design Principle
Agent orchestrates, scripts execute. The agent's job is to decide WHAT to do and kick off
the right script. All mechanical, repetitive work (downloading, transcribing, cleaning) is
handled by shell scripts with built-in parallelism. The agent NEVER loops through videos
one by one — it runs ONE command and the script handles concurrency internally.
Output Structure
CODEBLOCK0
Key decisions:
- - File names include title for readability (
BV1xxx_标题.txt) - Folder includes UP主 name (
UP主名_UID/) - Raw transcripts hidden in INLINECODE2
- No
_clean suffix — clean files are the main files - Per-video
.meta.json with title, uploader, duration, etc.
Full Pipeline
Step 1: Download AI subtitles (fast, high concurrency OK)
CODEBLOCK1
Step 2: For videos without AI subtitles, run whisper (LOW concurrency!)
CODEBLOCK2
Step 3: Clean + Index
CODEBLOCK3
Concurrency Guide
Critical: Different stages need different concurrency!
| Stage | Bottleneck | Recommended | Why |
|---|
| AI subtitle download | Network | 30-50 | B站 CDN handles high parallel |
| Whisper transcribe |
Metal GPU |
1-4 | GPU饱和,多了反而慢 |
| Transcript cleaning | API rate limit |
ALL (0) | Network I/O only |
Quick Start — Single Video
CODEBLOCK4
Transcript Cleaning
AI subtitles are clean enough — skipped by default.
| Source | Cleaning needed? |
|---|
| B站 AI subtitles | No — directly usable |
| whisper fallback |
Yes — goes through cleaning |
Cleaning uses opencode/minimax-m2.5-free:
- 1. Fix homophones and garbled words
- Add punctuation
- Output MUST be Simplified Chinese
- Keep uncertain proper nouns unchanged
- Never substitute one real term for another
Chunk size: 80 lines. Retry: 3 attempts with 3s delay.
⚠️ Long-running tasks
Use nohup to avoid session compaction killing processes:
nohup bash scripts/batch_clean.sh ./kb/UP主名_UID/ 0 80 > /tmp/clean.log 2>&1 &
batch_clean.sh is resumable — safe to re-run after interruption.
⚠️ Large Channel Handling (1000+ videos)
Script auto-detects large channels (>800 videos) and fetches in chunks to avoid timeout.
CODEBLOCK6
If still fails, manually fetch URL list:
CODEBLOCK7
⚠️ Thermal & Fan Warning
Keep system cool — avoid fan spin!
| Stage | Risk | Mitigation |
|---|
| Whisper (GPU) | HIGH | Keep concurrency ≤2, monitor temps |
| AI subtitle download |
Low | Can run 30-50 concurrent |
| Cleaning (API) | None | Pure network I/O, no local load |
If fans start spinning:
- - Stop whisper processes immediately
- Wait for cooldown
- Resume with lower concurrency (1-2)
CODEBLOCK8
Dependencies
Required: yt-dlp, ffmpeg, whisper.cpp (+ model), opencode CLI
Optional: Browser cookies for member-only content (--cookies-from-browser chrome)
Environment Variables
| Variable | Default | Description |
|---|
| INLINECODE7 | INLINECODE8 | Path to whisper.cpp |
| INLINECODE9 |
~/.whisper-cpp/ggml-small.bin | Whisper model |
|
OPENCODE_BIN |
~/.opencode/bin/opencode | opencode CLI |
|
CLEAN_MODEL |
opencode/minimax-m2.5-free | Cleaning model |
Tips
- - China users: Use
hf-mirror.com for whisper model - Long videos (1h+): Auto-segmented into 10-min chunks
- Resumable: All batch scripts skip already-processed files
Bilibili UP 转知识库
将B站视频(单个或整个频道)转换为清洗后的结构化文本知识库。
设计原则
智能体编排,脚本执行。 智能体的职责是决定做什么并启动正确的脚本。所有机械性、重复性的工作(下载、转录、清洗)均由内置并行能力的Shell脚本处理。智能体绝不会逐个遍历视频——它只运行一条命令,脚本内部处理并发。
输出结构
kb/UP主名_UID/
├── BV号_视频标题.txt # 清洗后的转录文本(面向用户)
├── BV号_视频标题.meta.json # 视频元数据
├── index.md # 摘要索引
└── .raw/ # 隐藏:whisper转录结果(如有)
└── BV号_视频标题.txt
关键决策:
- - 文件名包含标题以便阅读(BV1xxx标题.txt)
- 文件夹包含UP主名称(UP主名UID/)
- 原始转录结果隐藏在.raw/中
- 无_clean后缀——清洗后的文件即为主文件
- 每个视频附带.meta.json,包含标题、上传者、时长等信息
完整流程
步骤1:下载AI字幕(快速,高并发可行)
bash
30-50并发无问题——B站CDN可处理
scripts/batch_channel.sh https://space.bilibili.com/UID/ ./kb/output zh 0 30
步骤2:对无AI字幕的视频运行whisper(低并发!)
bash
Metal GPU只能处理1-4个并行whisper实例
更多反而更慢(GPU饱和)
scripts/batch_channel.sh https://space.bilibili.com/UID/ ./kb/output zh 0 2 --whisper-only
步骤3:清洗 + 索引
bash
清洗whisper转录文本(AI字幕自动跳过)
scripts/batch
clean.sh ./kb/UP主名UID/
scripts/generate
index.sh ./kb/UP主名UID/
并发指南
关键:不同阶段需要不同的并发数!
| 阶段 | 瓶颈 | 推荐并发数 | 原因 |
|---|
| AI字幕下载 | 网络 | 30-50 | B站CDN可处理高并发 |
| Whisper转录 |
Metal GPU |
1-4 | GPU饱和,多了反而慢 |
| 转录文本清洗 | API速率限制 |
全部(0) | 仅网络I/O |
快速开始——单个视频
bash
scripts/transcribe.sh https://www.bilibili.com/video/BV... ./output zh
转录文本清洗
AI字幕已足够干净——默认跳过。
| 来源 | 是否需要清洗? |
|---|
| B站AI字幕 | 否——可直接使用 |
| whisper备用方案 |
是——需经过清洗 |
清洗使用opencode/minimax-m2.5-free:
- 1. 修正同音字和乱码
- 添加标点符号
- 输出必须为简体中文
- 保留不确定的专有名词不变
- 绝不用一个真实术语替换另一个
分块大小:80行。重试:3次,间隔3秒。
⚠️ 长时间运行任务
使用nohup避免会话压缩导致进程终止:
bash
nohup bash scripts/batchclean.sh ./kb/UP主名UID/ 0 80 > /tmp/clean.log 2>&1 &
batch_clean.sh支持断点续传——中断后重新运行安全。
⚠️ 大型频道处理(1000+视频)
脚本自动检测大型频道(>800个视频)并分块获取以避免超时。
bash
自动分块,重新运行即可续传
nohup bash scripts/batch_channel.sh https://space.bilibili.com/UID/ ./kb/output > /tmp/batch.log 2>&1 &
如果仍然失败,手动获取URL列表:
bash
for i in $(seq 1 500 2000); do
yt-dlp --flat-playlist --playlist-start $i --playlist-end $((i+499)) \
--print url https://space.bilibili.com/UID/ >> /tmp/urls.txt
done
cat /tmp/urls.txt | xargs -P 20 -I {} bash scripts/transcribe.sh {} ./kb/OUTPUT zh
⚠️ 散热与风扇警告
保持系统冷却——避免风扇转动!
| 阶段 | 风险 | 缓解措施 |
|---|
| Whisper(GPU) | 高 | 保持并发≤2,监控温度 |
| AI字幕下载 |
低 | 可运行30-50并发 |
| 清洗(API) | 无 | 纯网络I/O,无本地负载 |
如果风扇开始转动:
- - 立即停止whisper进程
- 等待冷却
- 以更低并发数(1-2)恢复运行
bash
检查GPU温度(如使用CUDA)
nvidia-smi
检查Mac CPU/GPU温度
sudo powermetrics --sample-rate 1000 -i 1 -n 1 | grep -E CPU|GPU
依赖项
必需:yt-dlp、ffmpeg、whisper.cpp(+模型)、opencode CLI
可选:会员专属内容的浏览器Cookie(--cookies-from-browser chrome)
环境变量
| 变量 | 默认值 | 描述 |
|---|
| WHISPERCLI | whisper-cli | whisper.cpp路径 |
| WHISPERMODEL |
~/.whisper-cpp/ggml-small.bin | Whisper模型 |
| OPENCODE_BIN | ~/.opencode/bin/opencode | opencode CLI |
| CLEAN_MODEL | opencode/minimax-m2.5-free | 清洗模型 |
提示
- - 中国用户:使用hf-mirror.com获取whisper模型
- 长视频(1小时以上):自动分割为10分钟片段
- 断点续传:所有批处理脚本跳过已处理的文件