Getting Started
Welcome! I'm your AI Gemini video analyst — ready to watch, understand, and extract insights from your footage so you don't have to. Drop in your video and tell me what you'd like to know — let's get started!
Try saying:
- - "Summarize the key points discussed in this product demo video"
- "Identify and timestamp every scene change in this footage"
- "What emotions or reactions are visible in the people on screen throughout this clip?"
Quick Start Setup
This skill connects to NemoVideo's cloud backend. On first use, set up the connection automatically and let the user know ("Connecting to NemoVideo...").
Token check: Look for NEMO_TOKEN in the environment. If found, skip to session creation. Otherwise:
- - Read
~/.config/nemovideo/client_id or generate a UUID and save it there - POST
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with X-Client-Id header - Extract
data.token from the response — this is your NEMO_TOKEN (100 free credits, 7-day expiry)
Session: POST /api/tasks/me/with-session/nemo_agent at the same host with Bearer auth and body {"task_name":"project"}. Keep the returned session_id for all operations.
Let the user know with a brief "Ready!" when setup is complete. Don't expose tokens or raw API output.
Let Google Gemini See and Understand Your Videos
Most video tools can trim, cut, or apply effects — but they can't actually understand what's happening on screen. The ai-gemini skill changes that by running your video through Google Gemini's advanced multimodal reasoning engine, turning raw footage into structured, meaningful information you can actually use.
Whether you're a marketer trying to pull key messages from a product demo, a researcher cataloging interview footage, or a content creator looking for highlight moments, ai-gemini gives you a smart assistant that watches the video for you and reports back in plain language. Ask it to summarize the content, identify speakers, describe visual scenes, or flag specific moments — it handles all of it naturally.
This skill is built for people who work with video at scale or simply want to stop wasting time on manual review. Instead of watching a 45-minute recording to find one quote, let ai-gemini surface it in seconds. It's not just transcription — it's genuine video comprehension powered by one of the most capable AI models available today.
Routing Your Gemini Video Requests
Every request you send is parsed for intent and automatically routed to the appropriate Gemini multimodal endpoint — whether you're asking for scene breakdowns, transcript extraction, object detection, or sentiment analysis across a video.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
NemoVideo API Under the Hood
The NemoVideo backend acts as the orchestration layer between ClawHub and Google Gemini's multimodal models, handling video ingestion, chunking, and prompt forwarding so Gemini can process frames and audio streams at scale. Authentication tokens, session context, and credit allocation are all managed server-side through NemoVideo's infrastructure.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE9 - INLINECODE10 : from frontmatter INLINECODE11
- INLINECODE12 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE22
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: INLINECODE27
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE33
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE37
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE41
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Common Workflows
One of the most popular uses of the ai-gemini skill is post-meeting analysis. Upload a recorded team call or client meeting and ask Gemini to extract action items, decisions made, and unresolved questions — saving hours of note-taking and follow-up.
Content repurposing is another high-value workflow. Feed a long-form video like a webinar or tutorial into ai-gemini and ask it to identify the top five quotable moments or generate a blog-ready outline based on what was covered. This turns a single video asset into multiple content pieces effortlessly.
For educators and trainers, ai-gemini excels at reviewing instructional video content. Ask it to flag sections where a concept was explained unclearly, or generate a comprehension quiz based on what was taught. It reads visual context too, so diagrams and on-screen text are factored into its responses — not just the audio track.
FAQ
What kinds of questions can I ask about my video? You can ask nearly anything — from 'What is this video about?' to very specific queries like 'At what point does the presenter mention pricing?' or 'Describe the background setting in each scene.' Gemini understands both visual and audio content together.
Does ai-gemini work on videos without spoken dialogue? Yes. Since Gemini is multimodal, it analyzes visual content independently of audio. Silent videos, screen recordings, and footage with background music can all be processed and described meaningfully.
How long can the video be? Performance is best on videos up to 30 minutes, though longer files in supported formats (mp4, mov, avi, webm, mkv) can be processed. For very long recordings, consider splitting into segments for faster and more focused results.
Can it detect specific people or objects? Gemini can describe people, objects, and environments based on visual appearance and context, though it does not perform biometric identification by name unless the person is introduced verbally or via on-screen text.
快速开始
欢迎!我是你的AI Gemini视频分析师——准备好观看、理解并提取视频中的洞察,让你无需亲力亲为。上传你的视频,告诉我你想了解什么——让我们开始吧!
试试这样说:
- - 总结这个产品演示视频中讨论的要点
- 识别这段视频中的每个场景变化并标注时间戳
- 这段视频中屏幕上的人表现出哪些情绪或反应?
快速启动设置
此技能连接到NemoVideo的云后端。首次使用时,自动建立连接并通知用户(正在连接到NemoVideo...)。
令牌检查:在环境中查找NEMO_TOKEN。如果找到,跳转到会话创建。否则:
- - 读取~/.config/nemovideo/clientid或生成一个UUID并保存到该位置
- 使用X-Client-Id头信息POST请求https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token
- 从响应中提取data.token——这就是你的NEMOTOKEN(100个免费积分,7天有效期)
会话:在同一主机上使用Bearer认证和请求体{taskname:project}发送POST请求到/api/tasks/me/with-session/nemoagent。保留返回的session_id用于所有操作。
设置完成后,用简短的准备就绪!通知用户。不要暴露令牌或原始API输出。
让Google Gemini观看并理解你的视频
大多数视频工具可以裁剪、剪切或应用效果——但它们无法真正理解屏幕上正在发生的事情。ai-gemini技能通过将你的视频输入Google Gemini先进的多模态推理引擎,将原始视频转化为你可以实际使用的结构化、有意义的信息。
无论你是试图从产品演示中提取关键信息的营销人员,是整理采访视频的研究人员,还是寻找精彩瞬间的内容创作者,ai-gemini都能为你提供一个智能助手,替你观看视频并用通俗语言汇报。让它总结内容、识别发言人、描述视觉场景或标记特定时刻——它都能自然地处理。
此技能专为需要大规模处理视频或希望摆脱手动审查浪费时间的人而设计。与其观看45分钟的录像来找到一句引语,不如让ai-gemini在几秒钟内将其呈现出来。这不仅仅是转录——这是由当今最强大的AI模型之一驱动的真正视频理解能力。
路由你的Gemini视频请求
你发送的每个请求都会被解析意图,并自动路由到适当的Gemini多模态端点——无论你是要求场景分解、转录提取、物体检测还是视频情感分析。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加BGM…) | → §3.1 SSE | ❌ |
NemoVideo API底层机制
NemoVideo后端充当ClawHub和Google Gemini多模态模型之间的编排层,负责视频摄取、分块和提示转发,使Gemini能够大规模处理帧和音频流。认证令牌、会话上下文和积分分配都通过NemoVideo的基础设施在服务器端管理。
技能归属——运行时从此文件的YAML前置元数据中读取:
- - X-Skill-Source:ai-gemini
- X-Skill-Version:来自前置元数据version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则 → unknown)
所有请求必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属头信息将导致导出失败并返回402错误。
API基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST请求/api/tasks/me/with-session/nemoagent — 请求体{taskname:project,language:} — 返回taskid、sessionid。创建会话后,给用户一个链接:https://nemovideo.com/workspace/claim?token=&task=id>&session=id>&skillname=ai-gemini&skillversion=1.0.0&skill_source=
发送消息(SSE):POST请求/runsse — 请求体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}},附带Accept: text/event-stream。最大超时时间:15分钟。
上传:POST请求/api/upload-video/nemoagent/me/ — 文件:multipart格式-F files=@/path,或URL:{urls:[],sourcetype:url}
积分:GET请求/api/credits/balance/simple — 返回available、frozen、total
会话状态:GET请求/api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST请求/api/render/proxy/lambda — 请求体{id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET请求/api/render/proxy/lambda/,直到status = completed。下载URL位于output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空data: | 继续等待。每2分钟:⏳ 仍在处理中... |
| 流关闭 | 处理最终响应 |
约30%的编辑操作在SSE流中不返回文本。发生这种情况时:轮询会话状态以验证编辑是否已应用,然后向用户总结更改。
后端响应翻译
后端假设存在GUI。将这些翻译为API操作:
| 后端说 | 你执行 |
|---|
| click [button] / 点击 | 通过API执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过SSE发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=时长(毫秒),m=元数据。
时间线(3条轨道):1. 视频:城市延时摄影(0-10秒)2. BGM:Lo-fi(0-10秒,35%)3. 标题:都市梦想(0-3秒)
错误处理
令牌错误/过期 | 通过anonymous-token重新认证(令牌7天后过期) |
| 1002 | 会话未找到 | 新建会话 §3.0 |
| 2001 | 无积分 | 匿名用户:显示注册URL,附带?bind=
(需要时从create-session或state响应获取)。已注册用户:请在nemovideo.ai充值 |
| 4001 | 不支持的文件 | 显示支持的格式 |
| 4002 | 文件过大 | 建议压缩/裁剪 |
| 400 | 缺少X-Client-Id | 生成Client-Id并重试(参见§1) |
| 402 | 免费计划导出被阻止 | 订阅层级问题,非积分问题。请在nemovideo.ai注册以解锁导出功能。 |
| 429 | 速率限制(1个令牌/客户端/7天) | 30秒后重试一次 |
常见工作流
ai-gemini技能最流行的用途