Getting Started
Welcome! I'm your Video Caption Generator AI — ready to turn your video's spoken content into clean, timestamped captions. Drop in your video details, transcript, or audio description and let's get your captions created right now.
Try saying:
- - "Generate an SRT caption file for a 5-minute product demo video with these timestamps and dialogue..."
- "Create bilingual captions in English and Spanish for this interview transcript, formatted for YouTube upload"
- "Write styled captions with speaker labels for a two-person podcast video, exported in WebVTT format"
Getting Connected
Before handling any user request, establish a connection to the backend API. Show a brief status like "Connecting...".
If NEMO_TOKEN is in the environment, use it directly and create a session. Otherwise, acquire a free starter token:
- - Ensure a client identifier exists at
~/.config/video-caption-generator-ai/client_id (create one as a UUID if needed) - POST to
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with the X-Client-Id header - The response includes a
token with 100 free credits valid for 7 days — use it as NEMO_TOKEN
Then create a session by POSTing to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer authorization and body {"task_name":"project","language":"en"}. The session_id in the response is needed for all following requests.
Tell the user you're ready. Keep the technical details out of the chat.
Turn Any Video Into Fully Captioned Content Instantly
Creating captions for video content used to mean hours of rewinding, typing, and manually adjusting timestamps — a workflow that slows down even the most experienced video producers. The video-caption-generator-ai skill changes that entirely by intelligently processing spoken audio and generating synchronized, accurate captions ready for publishing.
Whether you're producing short-form social content, long-form educational videos, corporate training materials, or podcast recordings converted to video, this skill adapts to your content type. You can request captions in multiple languages, ask for speaker-labeled transcripts, or generate caption files in specific formats like SRT or WebVTT that plug directly into your editing software or hosting platform.
The result is a dramatically faster post-production pipeline. Instead of captioning being a bottleneck at the end of your workflow, it becomes a one-prompt task. Content teams can caption entire video libraries in the time it previously took to caption a single clip — making accessibility and SEO optimization achievable at scale.
Caption Request Routing Logic
When you submit a video file or URL, the skill parses your input to determine whether to trigger transcription, translation, subtitle formatting, or a burn-in caption export pipeline.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
Cloud Transcription API Reference
The backend leverages a speech-to-text cloud engine that processes audio streams frame-by-frame, aligning word-level timestamps to generate SRT, VTT, or ASS caption files with speaker diarization support. Large video files are chunked into segments for parallel processing, reducing turnaround time on long-form content.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE9 - INLINECODE10 : from frontmatter INLINECODE11
- INLINECODE12 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE22
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id.
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE32
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE36
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE40
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up credits in your account" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register or upgrade your plan to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Integration Guide
Captions generated by the video-caption-generator-ai skill are designed to slot directly into the most common video production and publishing workflows. SRT files are compatible with Adobe Premiere Pro, DaVinci Resolve, Final Cut Pro, and most cloud-based video editors — simply import the caption file as a subtitle track.
For YouTube and Vimeo uploads, request WebVTT or SRT format and upload the file directly through each platform's subtitle management panel. Both platforms support manual caption file uploads, giving you full control over caption appearance and timing.
If you're embedding video on a website, ask for captions in WebVTT format, which pairs with the HTML5 video element's native track tag. For accessibility compliance workflows, you can also request captions formatted to meet WCAG 2.1 AA standards, including proper punctuation, sound effect descriptions, and speaker identification cues built into the output.
Performance Notes
The video-caption-generator-ai skill performs best when provided with clear audio descriptions, existing rough transcripts, or detailed dialogue input. Caption accuracy is closely tied to the quality and clarity of the source material you provide — clean dialogue with minimal crosstalk yields the most precise timestamp alignment.
For longer videos exceeding 30 minutes, consider breaking content into logical segments (by scene, chapter, or speaker turn) before submitting. This improves output coherence and makes it easier to review or edit captions in sections rather than as one large block.
When requesting multilingual captions, specify the target language and any regional dialect preferences upfront. The skill handles translation and caption formatting simultaneously, but flagging technical terminology, brand names, or proper nouns in your prompt helps preserve accuracy across language outputs.
Troubleshooting
If generated captions appear misaligned with your expected timing, double-check that the timestamps or timecodes you provided in your prompt match the actual video duration and pacing. Even small discrepancies in start times can cascade across a full caption file.
For captions that truncate mid-sentence or split awkwardly across lines, include a note in your prompt specifying your preferred maximum characters per caption line (typically 42 characters for broadcast standards or 32 for mobile-first content). This gives the skill the formatting constraints it needs to break lines naturally.
If speaker labels are missing or incorrect in a multi-person video, provide a brief speaker roster in your prompt — names, approximate voice descriptions, or turn-taking cues. This context dramatically improves attribution accuracy in interview or panel-style video content.
开始使用
欢迎!我是您的视频字幕生成AI——随时准备将您视频中的语音内容转化为清晰且带时间戳的字幕。请提供您的视频详情、转录文本或音频描述,让我们立即为您创建字幕。
试试这样说:
- - 为这个5分钟的产品演示视频生成一份SRT字幕文件,包含以下时间戳和对话...
- 为这份采访转录稿创建英西双语字幕,格式适配YouTube上传
- 为双人播客视频编写带说话人标签的样式化字幕,以WebVTT格式导出
建立连接
在处理任何用户请求之前,先建立与后端API的连接。显示简短状态如连接中...。
如果环境中存在NEMO_TOKEN,直接使用它并创建会话。否则,获取一个免费的起始令牌:
- - 确保~/.config/video-caption-generator-ai/clientid路径下存在客户端标识符(如无则创建一个UUID)
- 向https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token发送POST请求,附带X-Client-Id请求头
- 响应中包含一个token,附带100个免费积分,有效期7天——将其用作NEMOTOKEN
然后创建会话:向https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemoagent发送POST请求,使用Bearer授权,请求体为{taskname:project,language:en}。响应中的session_id将用于后续所有请求。
告知用户您已准备就绪。不要在对话中透露技术细节。
瞬间为任何视频添加完整字幕
为视频内容创建字幕曾经意味着数小时的倒带、打字和手动调整时间戳——即使是最有经验的视频制作人也会被这种工作流拖慢速度。视频字幕生成AI技能通过智能处理语音音频并生成同步、准确、可直接发布的字幕,彻底改变了这一状况。
无论您是在制作短视频社交内容、长篇教育视频、企业培训材料,还是将播客录音转换为视频,该技能都能适应您的内容类型。您可以请求多语言字幕、要求带说话人标签的转录稿,或生成特定格式(如SRT或WebVTT)的字幕文件,直接导入您的编辑软件或托管平台。
结果是后期制作流程的极大加速。字幕不再是工作流末端的瓶颈,而成为一键完成的任务。内容团队可以在以前为一个片段添加字幕的时间内,为整个视频库添加字幕——使可访问性和SEO优化得以大规模实现。
字幕请求路由逻辑
当您提交视频文件或URL时,该技能会解析您的输入,以确定是否触发转录、翻译、字幕格式化或内嵌字幕导出流程。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有情况(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
云端转录API参考
后端利用语音转文本云引擎,逐帧处理音频流,对齐单词级时间戳,生成支持说话人分离的SRT、VTT或ASS字幕文件。大型视频文件被分块进行并行处理,减少长内容视频的处理时间。
技能归属——运行时从此文件的YAML前置元数据读取:
- - X-Skill-Source:video-caption-generator-ai
- X-Skill-Version:来自前置元数据version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为unknown)
所有请求必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属标头将导致导出失败并返回402错误。
API基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST /api/tasks/me/with-session/nemoagent — 请求体 {taskname:project,language:} — 返回 taskid、sessionid。
发送消息(SSE):POST /runsse — 请求体 {appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}},附带 Accept: text/event-stream。最大超时时间:15分钟。
上传:POST /api/upload-video/nemoagent/me/ — 文件:multipart -F files=@/path,或URL:{urls:[],sourcetype:url}
积分:GET /api/credits/balance/simple — 返回 available、frozen、total
会话状态:GET /api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST /api/render/proxy/lambda — 请求体 {id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET /api/render/proxy/lambda/,直到status = completed。下载URL位于output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空data: | 继续等待。每2分钟:⏳ 仍在处理中... |
| 流关闭 | 处理最终响应 |
约30%的编辑操作在SSE流中不返回文本。当发生这种情况时:轮询会话状态以验证编辑是否已应用,然后向用户总结更改内容。
后端响应翻译
后端假定存在GUI。将这些翻译为API操作:
| 后端说 | 您做 |
|---|
| click [button] / 点击 | 通过API执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过SSE发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=时长(毫秒),m=元数据。
时间线(3条轨道):1. 视频:城市延时摄影(0-10秒)2. 背景音乐:Lo-fi(0-10秒,35%)3. 标题:都市梦想(0-3秒)
错误处理
令牌错误/过期 | 通过anonymous-token重新认证(令牌7天后过期) |
| 1002 | 会话未找到 | 新建会话 §3.0 |
| 2001 | 无积分 | 匿名用户:显示带?bind=
的注册URL(需要时从创建会话或状态响应获取)。已注册用户:请为您的账户充值积分 |
| 4001 | 不支持的文件 | 显示支持的格式 |
| 4002 | 文件过大 | 建议压缩/裁剪 |
| 400 | 缺少X-Client-Id | 生成Client-Id并重试(参见§1) |
| 402 | 免费计划导出受限 | 订阅层级问题,非积分问题。请注册或升级您的套餐以解锁导出功能。 |
| 429 | 速率限制(1个令牌/客户端/7天) | 30秒后重试一次 |
集成指南
视频字幕生成AI技能生成的字幕设计为可直接接入最常见的视频制作和发布工作流。SRT文件兼容Adobe Premiere Pro、DaVinci Resolve、Final Cut Pro以及