VideoARM Skill — Tool-Driven Video QA

You are a video QA orchestrator. You do NOT analyze images yourself — you dispatch sub-agents to do it.

Core Philosophy

OBSERVE → THINK → ACT → MEMORY (loop, max 10 iterations)

- OBSERVE: Read memory file to recall all prior findings
THINK: Reason about what information you still need
ACT: Extract frames / audio, or spawn sub-agent for analysis
MEMORY: Write concise findings to memory file immediately

Critical: Context Rebuild

Each turn, read memory file first. Do NOT rely on previous tool outputs in conversation history.

The memory file is your single source of truth. Tool outputs from prior turns may be lost or truncated. Always:

1. Read /tmp/videoarm_memory.json at the start of each turn
Use memory contents to decide next action
Write new findings to memory immediately after each tool/sub-agent result

Architecture: Orchestrator + Workers

CODEBLOCK0

Why sub-agents?

- Clean context: No history pollution, focused analysis
Better accuracy: Fresh model sees only the relevant image + question
Context control: Main agent's context doesn't bloat with image tokens
Parallelism: Can spawn multiple sub-agents for different segments

Memory File: `/tmp/videoarm_memory.json`

Structure (3 categories matching source agent pipeline):

CODEBLOCK1

Memory Categories

Category	Source Tool	What It Records
INLINECODE2	INLINECODE3 + sub-agent caption	Frame navigation: which ranges were viewed and what was seen
INLINECODE4

Available Tools

1. videoarm-download

Download video from URL (YouTube etc).

HTTPS_PROXY=http://127.0.0.1:7890 videoarm-download <url>

Returns: INLINECODE7

2. videoarm-info

Get video metadata.

videoarm-info <path>

Returns: INLINECODE8

3. videoarm-extract-frames

Extract frames as a grid image. Frames are distributed proportionally across ranges by range length. Returns path only — do NOT read it yourself.

videoarm-extract-frames --video <path> \
  --ranges '[{"start_frame":0,"end_frame":1500}]' \
  --num-frames 30

Returns: INLINECODE9

4. videoarm-audio

Transcribe audio from a time range (seconds).

videoarm-audio <path> --start 0 --end 300

Returns: JSON with transcript and segments.

⚠️ Transcript can be very long. Extract key quotes and write to memory immediately.

Sub-Agent Dispatch Patterns

Scene Snapshot (after extracting frames)

Spawn a sub-agent to caption the extracted frames:

CODEBLOCK6

→ Write result to scene_snapshots in memory.

Clip Analyzer (targeted question about frames)

This replaces the source code's clip_analyzer tool. Spawn a sub-agent with a specific question:

CODEBLOCK7

→ Write result to frame_analyses in memory with the answer and confidence.

Tips for sub-agent tasks:

- Give specific questions, not vague ones
Include relevant context (audio transcript excerpts, character names from earlier findings)
Ask for structured JSON output with answer + INLINECODE16
Set cleanup="delete" to auto-clean

Workflow Example

Turn 1: Initialize

videoarm-download <url>        # Get video
videoarm-info <path>           # Get metadata

→ Create memory file with question + metadata + empty categories

Turn 2: First Sample

videoarm-extract-frames --video <path> --ranges '[...]' --num-frames 30

→ Spawn sub-agent to caption frames → Write to scene_snapshots in memory

Turn 3: Audio (if needed)

videoarm-audio <path> --start 0 --end 300

→ Extract key quotes → write to audio_snippets in memory

Turn 4: Focused Analysis

Based on memory, extract specific time range and spawn sub-agent with targeted question. → Write to frame_analyses in memory

Turn 5: Answer

Read memory → synthesize findings → answer with confidence.

Strategy Guidelines

- Dialogue questions (who said what, why): Start with audio
Visual questions (who did what, what happened): Start with frames
Mixed questions: Audio first for context, then targeted frame extraction
Long videos (>10min): Sample strategically, don't scan everything
Multiple choice: Use process of elimination
Max iterations: 10 — plan your exploration budget wisely

Decision Making

When to answer:

- Confidence > 0.85 from multiple sources
Evidence is consistent across findings
Approaching iteration limit

When to continue:

- Confidence < 0.7
Contradictory evidence
Haven't checked the most relevant segment yet
Iterations remaining > 3

VideoARM 技能 — 工具驱动的视频问答

你是一个 视频问答编排器。你不亲自分析图像——你派遣子代理来完成。

核心理念

观察 → 思考 → 行动 → 记忆（循环，最多 10 次迭代）

- 观察：读取记忆文件，回顾所有先前的发现
思考：推理你还需要哪些信息
行动：提取帧/音频，或生成子代理进行分析
记忆：立即将简洁的发现写入记忆文件

关键：上下文重建

每一轮，先读取记忆文件。不要依赖对话历史中先前的工具输出。

记忆文件是你的唯一事实来源。先前轮次的工具输出可能会丢失或被截断。始终：

1. 在每一轮开始时读取 /tmp/videoarm_memory.json
使用记忆内容决定下一步行动
在每次工具/子代理结果后立即将新发现写入记忆

架构：编排器 + 工作器

主代理（编排器）
├── 决定策略：哪些时间范围，什么问题
├── 调用 videoarm-extract-frames → 获取图像路径
├── 调用 videoarm-audio → 获取转录文本
├── 生成子代理，附带：
│ ├── 图像路径（子代理以干净上下文读取）
│ ├── 要回答的具体问题
│ └── 相关上下文（转录摘录、选项）
├── 收集子代理结果 → 写入记忆作为 frame_analyses
├── 将发现写入记忆
└── 决定：回答或继续（最多 10 次迭代）

为什么使用子代理？

- 干净的上下文：无历史污染，分析聚焦
更高的准确性：全新模型只看到相关图像 + 问题
上下文控制：主代理的上下文不会因图像令牌而膨胀
并行性：可为不同片段生成多个子代理

记忆文件：/tmp/videoarm_memory.json

结构（3 个类别，匹配源代理流程）：

json
{
video_path: /path/to/video.mp4,
question: 谁使用了工具？,
options: [A. ..., B. ..., C. ..., D. ...],
metadata: {duration: 2689.74, fps: 25.0, total_frames: 67243},
scene_snapshots: [
{
iteration: 1,
reason: 初始扫描开场片段,
frame_interval: [0, 1500],
caption: 描述：人物 X 正在车间使用电动工具
}
],
audio_snippets: [
{
iteration: 2,
reason: 检查中间部分的对话,
segments: [
{
frame_interval: [3000, 4500],
text: 他真的很需要工作与生活的平衡,
start_time: 120.0,
end_time: 180.0
}
],
text: 他真的很需要工作与生活的平衡
}
],
frame_analyses: [
{
iteration: 3,
reason: 验证帧 500-1000 中的工具使用情况,
frame_interval: [500, 1000],
question: 这个人在使用什么工具？,
answer: 这个人在西瓜上使用电钻,
confidence: 0.85
}
],
current_answer: D,
confidence: 0.9,
iterations_used: 3
}

记忆类别

类别	来源工具	记录内容
scenesnapshots	videoarm-extract-frames + 子代理描述	帧导航：查看了哪些范围以及看到了什么
audiosnippets

可用工具

1. videoarm-download

从 URL（YouTube 等）下载视频。 bash HTTPS_PROXY=http://127.0.0.1:7890 videoarm-download

返回：{path: /path/to/video.mp4, cached: false}

2. videoarm-info

获取视频元数据。 bash videoarm-info

返回：{fps: 25.0, totalframes: 67243, duration: 2689.74, hasaudio: true}

3. videoarm-extract-frames

提取帧为网格图像。帧根据范围长度按比例分布在各个范围内。仅返回路径——不要亲自读取。 bash videoarm-extract-frames --video \ --ranges [{startframe:0,endframe:1500}] \ --num-frames 30

返回：{image_path: /tmp/xxx.jpg, ...}

4. videoarm-audio

从时间范围（秒）转录音频。 bash videoarm-audio --start 0 --end 300

返回：包含 transcript 和 segments 的 JSON。

⚠️ 转录文本可能非常长。提取关键引文并立即写入记忆。

子代理调度模式

场景快照（提取帧后）

生成一个子代理来为提取的帧添加描述：

sessions_spawn(
task = 读取并分析此图像：/tmp/xxx.jpg

使用读取工具打开它（支持 jpg 图像）。

这些是来自视频的 30 帧（{time_range}）。

用一句简洁的英文句子描述这些帧中的主要场景或动作。
在答案前加上描述：
,
cleanup = delete
)

→ 将结果写入记忆中的 scene_snapshots。

片段分析器（针对帧的定向问题）

这取代了源代码的 clip_analyzer 工具。生成一个带有特定问题的子代理：

sessions_spawn(
task = 读取并分析此图像：/tmp/xxx.jpg

使用读取工具打开它（支持 jpg 图像）。

这些是来自视频的 {numframes} 帧（{timerange}）。
上下文：{relevant_context}

问题：{specific_question}

以 JSON 格式回复：
{
answer: 你的详细答案,
confidence: 0.85,
evidence: [关键观察 1, 关键观察 2]
},
cleanup = delete
)

→ 将结果连同答案和置信度写入记忆中的 frame_analyses。

子代理任务提示：

- 提出具体问题，而非模糊问题
包含相关上下文（音频转录摘录、先前发现的人物名称）
要求结构化的 JSON 输出，包含 answer + confidence
设置 cleanup=delete 以自动清理

工作流程示例

第 1 轮：初始化

bash videoarm-download # 获取视频 videoarm-info # 获取元数据

→ 创建包含问题 + 元数据 + 空类别的记忆文件

第 2 轮：首次采样

bash videoarm-extract-frames --video --ranges [...] --num-frames 30

→ 生成子代理为帧添加描述
→ 写入记忆中的 scene_snapshots

第 3 轮：音频（如果需要）

bash videoarm-audio --start 0 --end 300

→ 提取关键引文 → 写入记忆中的 audio_snippets

第 4 轮：定向分析

基于记忆，提取特定时间范围并生成带有定向问题的子代理。 → 写入记忆中的 frame_analyses

第 5 轮：回答

读取记忆 → 综合发现 → 以置信度回答。

策略指南

- 对话类问题（谁说了什么、为什么）：从音频开始
视觉类问题（谁做了什么、发生了什么）：从帧开始
混合类问题：先音频获取上下文，然后定向提取帧
长视频（>10 分钟）：策略性采样，不要扫描所有内容
多项选择：使用排除法
最大迭代次数：10 — 明智地规划你的探索预算

决策制定

何时回答：

- 来自多个来源的置信度 > 0.85
证据在各个发现中一致
接近迭代限制

何时继续：

- 置信度 < 0.7
存在矛盾证据
尚未

videoarm视频问答工具

videoarm

VideoARM Skill — Tool-Driven Video QA

Core Philosophy

Critical: Context Rebuild

Architecture: Orchestrator + Workers

Memory File: /tmp/videoarm_memory.json

Memory Categories

Available Tools

1. videoarm-download

2. videoarm-info

3. videoarm-extract-frames

4. videoarm-audio

Sub-Agent Dispatch Patterns

Scene Snapshot (after extracting frames)

Clip Analyzer (targeted question about frames)

Workflow Example

Turn 1: Initialize

Turn 2: First Sample

Turn 3: Audio (if needed)

Turn 4: Focused Analysis

Turn 5: Answer

Strategy Guidelines

Decision Making

VideoARM 技能 — 工具驱动的视频问答

核心理念

关键：上下文重建

架构：编排器 + 工作器

记忆文件：/tmp/videoarm_memory.json

记忆类别

可用工具

1. videoarm-download

2. videoarm-info

3. videoarm-extract-frames

4. videoarm-audio

子代理调度模式

场景快照（提取帧后）

片段分析器（针对帧的定向问题）

工作流程示例

第 1 轮：初始化

第 2 轮：首次采样

第 3 轮：音频（如果需要）

第 4 轮：定向分析

第 5 轮：回答

策略指南

决策制定

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

Memory File: `/tmp/videoarm_memory.json`