Getting Started

Welcome! I'm your AI Gemini video analyst — ready to watch, understand, and extract insights from your footage so you don't have to. Drop in your video and tell me what you'd like to know — let's get started!

Try saying:

- "Summarize the key points discussed in this product demo video"
"Identify and timestamp every scene change in this footage"
"What emotions or reactions are visible in the people on screen throughout this clip?"

Quick Start Setup

This skill connects to NemoVideo's cloud backend. On first use, set up the connection automatically and let the user know ("Connecting to NemoVideo...").

Token check: Look for NEMO_TOKEN in the environment. If found, skip to session creation. Otherwise:

- Read ~/.config/nemovideo/client_id or generate a UUID and save it there
POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with X-Client-Id header
Extract data.token from the response — this is your NEMO_TOKEN (100 free credits, 7-day expiry)

Session: POST /api/tasks/me/with-session/nemo_agent at the same host with Bearer auth and body {"task_name":"project"}. Keep the returned session_id for all operations.

Let the user know with a brief "Ready!" when setup is complete. Don't expose tokens or raw API output.

Let Google Gemini See and Understand Your Videos

Most video tools can trim, cut, or apply effects — but they can't actually understand what's happening on screen. The ai-gemini skill changes that by running your video through Google Gemini's advanced multimodal reasoning engine, turning raw footage into structured, meaningful information you can actually use.

Whether you're a marketer trying to pull key messages from a product demo, a researcher cataloging interview footage, or a content creator looking for highlight moments, ai-gemini gives you a smart assistant that watches the video for you and reports back in plain language. Ask it to summarize the content, identify speakers, describe visual scenes, or flag specific moments — it handles all of it naturally.

This skill is built for people who work with video at scale or simply want to stop wasting time on manual review. Instead of watching a 45-minute recording to find one quote, let ai-gemini surface it in seconds. It's not just transcription — it's genuine video comprehension powered by one of the most capable AI models available today.

Routing Your Gemini Video Requests

Every request you send is parsed for intent and automatically routed to the appropriate Gemini multimodal endpoint — whether you're asking for scene breakdowns, transcript extraction, object detection, or sentiment analysis across a video.

User says...	Action	Skip SSE?
"export" / "导出" / "download" / "send me the video"	→ §3.5 Export	✅
"credits" / "积分" / "balance" / "余额"

→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |

NemoVideo API Under the Hood

The NemoVideo backend acts as the orchestration layer between ClawHub and Google Gemini's multimodal models, handling video ingestion, chunking, and prompt forwarding so Gemini can process frames and audio streams at scale. Authentication tokens, session context, and credit allocation are all managed server-side through NemoVideo's infrastructure.

Skill attribution — read from this file's YAML frontmatter at runtime:

- X-Skill-Source: INLINECODE9
INLINECODE10: from frontmatter INLINECODE11
INLINECODE12: detect from install path (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)

All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.

API base: INLINECODE22

Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: INLINECODE27

Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.

Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE33

Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE37

Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE41

Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.

Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

SSE Event Handling

Event	Action
Text response	Apply GUI translation (§4), present to user
Tool call/result

~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.

Backend Response Translation

The backend assumes a GUI exists. Translate these into API actions:

Backend says	You do
"click [button]" / "点击"	Execute via API
"open [panel]" / "打开"

Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.

CODEBLOCK0

Error Handling

Code	Meaning	Action
0	Success	Continue
1001

Common Workflows

One of the most popular uses of the ai-gemini skill is post-meeting analysis. Upload a recorded team call or client meeting and ask Gemini to extract action items, decisions made, and unresolved questions — saving hours of note-taking and follow-up.

Content repurposing is another high-value workflow. Feed a long-form video like a webinar or tutorial into ai-gemini and ask it to identify the top five quotable moments or generate a blog-ready outline based on what was covered. This turns a single video asset into multiple content pieces effortlessly.

For educators and trainers, ai-gemini excels at reviewing instructional video content. Ask it to flag sections where a concept was explained unclearly, or generate a comprehension quiz based on what was taught. It reads visual context too, so diagrams and on-screen text are factored into its responses — not just the audio track.

FAQ

What kinds of questions can I ask about my video? You can ask nearly anything — from 'What is this video about?' to very specific queries like 'At what point does the presenter mention pricing?' or 'Describe the background setting in each scene.' Gemini understands both visual and audio content together.

Does ai-gemini work on videos without spoken dialogue? Yes. Since Gemini is multimodal, it analyzes visual content independently of audio. Silent videos, screen recordings, and footage with background music can all be processed and described meaningfully.

How long can the video be? Performance is best on videos up to 30 minutes, though longer files in supported formats (mp4, mov, avi, webm, mkv) can be processed. For very long recordings, consider splitting into segments for faster and more focused results.

Can it detect specific people or objects? Gemini can describe people, objects, and environments based on visual appearance and context, though it does not perform biometric identification by name unless the person is introduced verbally or via on-screen text.

快速开始

欢迎！我是你的AI Gemini视频分析师——准备好观看、理解并提取视频中的洞察，让你无需亲力亲为。上传你的视频，告诉我你想了解什么——让我们开始吧！

试试这样说：

- 总结这个产品演示视频中讨论的要点
识别这段视频中的每个场景变化并标注时间戳
这段视频中屏幕上的人表现出哪些情绪或反应？

快速启动设置

此技能连接到NemoVideo的云后端。首次使用时，自动建立连接并通知用户（正在连接到NemoVideo...）。

令牌检查：在环境中查找NEMO_TOKEN。如果找到，跳转到会话创建。否则：

- 读取~/.config/nemovideo/clientid或生成一个UUID并保存到该位置
使用X-Client-Id头信息POST请求https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token
从响应中提取data.token——这就是你的NEMOTOKEN（100个免费积分，7天有效期）

会话：在同一主机上使用Bearer认证和请求体{taskname:project}发送POST请求到/api/tasks/me/with-session/nemoagent。保留返回的session_id用于所有操作。

设置完成后，用简短的准备就绪！通知用户。不要暴露令牌或原始API输出。

让Google Gemini观看并理解你的视频

大多数视频工具可以裁剪、剪切或应用效果——但它们无法真正理解屏幕上正在发生的事情。ai-gemini技能通过将你的视频输入Google Gemini先进的多模态推理引擎，将原始视频转化为你可以实际使用的结构化、有意义的信息。

无论你是试图从产品演示中提取关键信息的营销人员，是整理采访视频的研究人员，还是寻找精彩瞬间的内容创作者，ai-gemini都能为你提供一个智能助手，替你观看视频并用通俗语言汇报。让它总结内容、识别发言人、描述视觉场景或标记特定时刻——它都能自然地处理。

此技能专为需要大规模处理视频或希望摆脱手动审查浪费时间的人而设计。与其观看45分钟的录像来找到一句引语，不如让ai-gemini在几秒钟内将其呈现出来。这不仅仅是转录——这是由当今最强大的AI模型之一驱动的真正视频理解能力。

路由你的Gemini视频请求

你发送的每个请求都会被解析意图，并自动路由到适当的Gemini多模态端点——无论你是要求场景分解、转录提取、物体检测还是视频情感分析。

用户说...	操作	跳过SSE？
export / 导出 / download / send me the video	→ §3.5 导出	✅
credits / 积分 / balance / 余额

→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容（生成、编辑、添加BGM…） | → §3.1 SSE | ❌ |

NemoVideo API底层机制

NemoVideo后端充当ClawHub和Google Gemini多模态模型之间的编排层，负责视频摄取、分块和提示转发，使Gemini能够大规模处理帧和音频流。认证令牌、会话上下文和积分分配都通过NemoVideo的基础设施在服务器端管理。

技能归属——运行时从此文件的YAML前置元数据中读取：

- X-Skill-Source：ai-gemini
X-Skill-Version：来自前置元数据version
X-Skill-Platform：从安装路径检测（~/.clawhub/ → clawhub，~/.cursor/skills/ → cursor，否则 → unknown）

所有请求必须包含：Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属头信息将导致导出失败并返回402错误。

API基础地址：https://mega-api-prod.nemovideo.ai

创建会话：POST请求/api/tasks/me/with-session/nemoagent — 请求体{taskname:project,language:} — 返回taskid、sessionid。创建会话后，给用户一个链接：https://nemovideo.com/workspace/claim?token=&task=id>&session=id>&skillname=ai-gemini&skillversion=1.0.0&skill_source=

发送消息（SSE）：POST请求/runsse — 请求体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}}，附带Accept: text/event-stream。最大超时时间：15分钟。

上传：POST请求/api/upload-video/nemoagent/me/ — 文件：multipart格式-F files=@/path，或URL：{urls:[],sourcetype:url}

积分：GET请求/api/credits/balance/simple — 返回available、frozen、total

会话状态：GET请求/api/state/nemoagent/me//latest — 关键字段：data.state.draft、data.state.videoinfos、data.state.generated_media

导出（免费，不消耗积分）：POST请求/api/render/proxy/lambda — 请求体{id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET请求/api/render/proxy/lambda/，直到status = completed。下载URL位于output.url。

支持的格式：mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。

SSE事件处理

事件	操作
文本响应	应用GUI翻译（§4），呈现给用户
工具调用/结果

约30%的编辑操作在SSE流中不返回文本。发生这种情况时：轮询会话状态以验证编辑是否已应用，然后向用户总结更改。

后端响应翻译

后端假设存在GUI。将这些翻译为API操作：

后端说	你执行
click [button] / 点击	通过API执行
open [panel] / 打开

草稿字段映射：t=轨道，tt=轨道类型（0=视频，1=音频，7=文本），sg=片段，d=时长（毫秒），m=元数据。

时间线（3条轨道）：1. 视频：城市延时摄影（0-10秒）2. BGM：Lo-fi（0-10秒，35%）3. 标题：都市梦想（0-3秒）

错误处理

代码	含义	操作
0	成功	继续
1001

常见工作流

ai-gemini技能最流行的用途

ai-geminiAI双子星

ai-gemini

Getting Started

Quick Start Setup

Let Google Gemini See and Understand Your Videos

Routing Your Gemini Video Requests

NemoVideo API Under the Hood

SSE Event Handling

Backend Response Translation

Error Handling

Common Workflows

FAQ

快速开始

快速启动设置

让Google Gemini观看并理解你的视频

路由你的Gemini视频请求

NemoVideo API底层机制

SSE事件处理

后端响应翻译

错误处理

常见工作流

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

ai-geminiAI双子星

ai-gemini

Getting Started

Quick Start Setup

Let Google Gemini See and Understand Your Videos

Routing Your Gemini Video Requests

NemoVideo API Under the Hood

SSE Event Handling

Backend Response Translation

Error Handling

Common Workflows

FAQ

快速开始

快速启动设置

让Google Gemini观看并理解你的视频

路由你的Gemini视频请求

NemoVideo API底层机制

SSE事件处理

后端响应翻译

错误处理

常见工作流

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement