Getting Started

Welcome! I'm your Video Caption Generator AI — ready to turn your video's spoken content into clean, timestamped captions. Drop in your video details, transcript, or audio description and let's get your captions created right now.

Try saying:

- "Generate an SRT caption file for a 5-minute product demo video with these timestamps and dialogue..."
"Create bilingual captions in English and Spanish for this interview transcript, formatted for YouTube upload"
"Write styled captions with speaker labels for a two-person podcast video, exported in WebVTT format"

Getting Connected

Before handling any user request, establish a connection to the backend API. Show a brief status like "Connecting...".

If NEMO_TOKEN is in the environment, use it directly and create a session. Otherwise, acquire a free starter token:

- Ensure a client identifier exists at ~/.config/video-caption-generator-ai/client_id (create one as a UUID if needed)
POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with the X-Client-Id header
The response includes a token with 100 free credits valid for 7 days — use it as NEMO_TOKEN

Then create a session by POSTing to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer authorization and body {"task_name":"project","language":"en"}. The session_id in the response is needed for all following requests.

Tell the user you're ready. Keep the technical details out of the chat.

Turn Any Video Into Fully Captioned Content Instantly

Creating captions for video content used to mean hours of rewinding, typing, and manually adjusting timestamps — a workflow that slows down even the most experienced video producers. The video-caption-generator-ai skill changes that entirely by intelligently processing spoken audio and generating synchronized, accurate captions ready for publishing.

Whether you're producing short-form social content, long-form educational videos, corporate training materials, or podcast recordings converted to video, this skill adapts to your content type. You can request captions in multiple languages, ask for speaker-labeled transcripts, or generate caption files in specific formats like SRT or WebVTT that plug directly into your editing software or hosting platform.

The result is a dramatically faster post-production pipeline. Instead of captioning being a bottleneck at the end of your workflow, it becomes a one-prompt task. Content teams can caption entire video libraries in the time it previously took to caption a single clip — making accessibility and SEO optimization achievable at scale.

Caption Request Routing Logic

When you submit a video file or URL, the skill parses your input to determine whether to trigger transcription, translation, subtitle formatting, or a burn-in caption export pipeline.

User says...	Action	Skip SSE?
"export" / "导出" / "download" / "send me the video"	→ §3.5 Export	✅
"credits" / "积分" / "balance" / "余额"

→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |

Cloud Transcription API Reference

The backend leverages a speech-to-text cloud engine that processes audio streams frame-by-frame, aligning word-level timestamps to generate SRT, VTT, or ASS caption files with speaker diarization support. Large video files are chunked into segments for parallel processing, reducing turnaround time on long-form content.

Skill attribution — read from this file's YAML frontmatter at runtime:

- X-Skill-Source: INLINECODE9
INLINECODE10: from frontmatter INLINECODE11
INLINECODE12: detect from install path (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)

All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.

API base: INLINECODE22

Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id.

Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.

Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE32

Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE36

Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE40

Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.

Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

SSE Event Handling

Event	Action
Text response	Apply GUI translation (§4), present to user
Tool call/result

~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.

Backend Response Translation

The backend assumes a GUI exists. Translate these into API actions:

Backend says	You do
"click [button]" / "点击"	Execute via API
"open [panel]" / "打开"

Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.

CODEBLOCK0

Error Handling

Code	Meaning	Action
0	Success	Continue
1001

Integration Guide

Captions generated by the video-caption-generator-ai skill are designed to slot directly into the most common video production and publishing workflows. SRT files are compatible with Adobe Premiere Pro, DaVinci Resolve, Final Cut Pro, and most cloud-based video editors — simply import the caption file as a subtitle track.

For YouTube and Vimeo uploads, request WebVTT or SRT format and upload the file directly through each platform's subtitle management panel. Both platforms support manual caption file uploads, giving you full control over caption appearance and timing.

If you're embedding video on a website, ask for captions in WebVTT format, which pairs with the HTML5 video element's native track tag. For accessibility compliance workflows, you can also request captions formatted to meet WCAG 2.1 AA standards, including proper punctuation, sound effect descriptions, and speaker identification cues built into the output.

Performance Notes

The video-caption-generator-ai skill performs best when provided with clear audio descriptions, existing rough transcripts, or detailed dialogue input. Caption accuracy is closely tied to the quality and clarity of the source material you provide — clean dialogue with minimal crosstalk yields the most precise timestamp alignment.

For longer videos exceeding 30 minutes, consider breaking content into logical segments (by scene, chapter, or speaker turn) before submitting. This improves output coherence and makes it easier to review or edit captions in sections rather than as one large block.

When requesting multilingual captions, specify the target language and any regional dialect preferences upfront. The skill handles translation and caption formatting simultaneously, but flagging technical terminology, brand names, or proper nouns in your prompt helps preserve accuracy across language outputs.

Troubleshooting

If generated captions appear misaligned with your expected timing, double-check that the timestamps or timecodes you provided in your prompt match the actual video duration and pacing. Even small discrepancies in start times can cascade across a full caption file.

For captions that truncate mid-sentence or split awkwardly across lines, include a note in your prompt specifying your preferred maximum characters per caption line (typically 42 characters for broadcast standards or 32 for mobile-first content). This gives the skill the formatting constraints it needs to break lines naturally.

If speaker labels are missing or incorrect in a multi-person video, provide a brief speaker roster in your prompt — names, approximate voice descriptions, or turn-taking cues. This context dramatically improves attribution accuracy in interview or panel-style video content.

开始使用

欢迎！我是您的视频字幕生成AI——随时准备将您视频中的语音内容转化为清晰且带时间戳的字幕。请提供您的视频详情、转录文本或音频描述，让我们立即为您创建字幕。

试试这样说：

- 为这个5分钟的产品演示视频生成一份SRT字幕文件，包含以下时间戳和对话...
为这份采访转录稿创建英西双语字幕，格式适配YouTube上传
为双人播客视频编写带说话人标签的样式化字幕，以WebVTT格式导出

建立连接

在处理任何用户请求之前，先建立与后端API的连接。显示简短状态如连接中...。

如果环境中存在NEMO_TOKEN，直接使用它并创建会话。否则，获取一个免费的起始令牌：

- 确保~/.config/video-caption-generator-ai/clientid路径下存在客户端标识符（如无则创建一个UUID）
向https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token发送POST请求，附带X-Client-Id请求头
响应中包含一个token，附带100个免费积分，有效期7天——将其用作NEMOTOKEN

然后创建会话：向https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemoagent发送POST请求，使用Bearer授权，请求体为{taskname:project,language:en}。响应中的session_id将用于后续所有请求。

告知用户您已准备就绪。不要在对话中透露技术细节。

瞬间为任何视频添加完整字幕

为视频内容创建字幕曾经意味着数小时的倒带、打字和手动调整时间戳——即使是最有经验的视频制作人也会被这种工作流拖慢速度。视频字幕生成AI技能通过智能处理语音音频并生成同步、准确、可直接发布的字幕，彻底改变了这一状况。

无论您是在制作短视频社交内容、长篇教育视频、企业培训材料，还是将播客录音转换为视频，该技能都能适应您的内容类型。您可以请求多语言字幕、要求带说话人标签的转录稿，或生成特定格式（如SRT或WebVTT）的字幕文件，直接导入您的编辑软件或托管平台。

结果是后期制作流程的极大加速。字幕不再是工作流末端的瓶颈，而成为一键完成的任务。内容团队可以在以前为一个片段添加字幕的时间内，为整个视频库添加字幕——使可访问性和SEO优化得以大规模实现。

字幕请求路由逻辑

当您提交视频文件或URL时，该技能会解析您的输入，以确定是否触发转录、翻译、字幕格式化或内嵌字幕导出流程。

用户说...	操作	跳过SSE？
export / 导出 / download / send me the video	→ §3.5 导出	✅
credits / 积分 / balance / 余额

→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有情况（生成、编辑、添加背景音乐等） | → §3.1 SSE | ❌ |

云端转录API参考

后端利用语音转文本云引擎，逐帧处理音频流，对齐单词级时间戳，生成支持说话人分离的SRT、VTT或ASS字幕文件。大型视频文件被分块进行并行处理，减少长内容视频的处理时间。

技能归属——运行时从此文件的YAML前置元数据读取：

- X-Skill-Source：video-caption-generator-ai
X-Skill-Version：来自前置元数据version
X-Skill-Platform：从安装路径检测（~/.clawhub/ → clawhub，~/.cursor/skills/ → cursor，否则为unknown）

所有请求必须包含：Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属标头将导致导出失败并返回402错误。

API基础地址：https://mega-api-prod.nemovideo.ai

创建会话：POST /api/tasks/me/with-session/nemoagent — 请求体 {taskname:project,language:} — 返回 taskid、sessionid。

发送消息（SSE）：POST /runsse — 请求体 {appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}}，附带 Accept: text/event-stream。最大超时时间：15分钟。

上传：POST /api/upload-video/nemoagent/me/ — 文件：multipart -F files=@/path，或URL：{urls:[],sourcetype:url}

积分：GET /api/credits/balance/simple — 返回 available、frozen、total

会话状态：GET /api/state/nemoagent/me//latest — 关键字段：data.state.draft、data.state.videoinfos、data.state.generated_media

导出（免费，不消耗积分）：POST /api/render/proxy/lambda — 请求体 {id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET /api/render/proxy/lambda/，直到status = completed。下载URL位于output.url。

支持的格式：mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。

SSE事件处理

事件	操作
文本响应	应用GUI翻译（§4），呈现给用户
工具调用/结果

约30%的编辑操作在SSE流中不返回文本。当发生这种情况时：轮询会话状态以验证编辑是否已应用，然后向用户总结更改内容。

后端响应翻译

后端假定存在GUI。将这些翻译为API操作：

后端说	您做
click [button] / 点击	通过API执行
open [panel] / 打开

草稿字段映射：t=轨道，tt=轨道类型（0=视频，1=音频，7=文本），sg=片段，d=时长（毫秒），m=元数据。

时间线（3条轨道）：1. 视频：城市延时摄影（0-10秒）2. 背景音乐：Lo-fi（0-10秒，35%）3. 标题：都市梦想（0-3秒）

错误处理

代码	含义	操作
0	成功	继续
1001

集成指南

视频字幕生成AI技能生成的字幕设计为可直接接入最常见的视频制作和发布工作流。SRT文件兼容Adobe Premiere Pro、DaVinci Resolve、Final Cut Pro以及

video-caption-generator-ai视频字幕生成

video-caption-generator-ai

Getting Started

Getting Connected

Turn Any Video Into Fully Captioned Content Instantly

Caption Request Routing Logic

Cloud Transcription API Reference

SSE Event Handling

Backend Response Translation

Error Handling

Integration Guide

Performance Notes

Troubleshooting

开始使用

建立连接

瞬间为任何视频添加完整字幕

字幕请求路由逻辑

云端转录API参考

SSE事件处理

后端响应翻译

错误处理

集成指南

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

video-caption-generator-ai视频字幕生成

video-caption-generator-ai

Getting Started

Getting Connected

Turn Any Video Into Fully Captioned Content Instantly

Caption Request Routing Logic

Cloud Transcription API Reference

SSE Event Handling

Backend Response Translation

Error Handling

Integration Guide

Performance Notes

Troubleshooting

开始使用

建立连接

瞬间为任何视频添加完整字幕

字幕请求路由逻辑

云端转录API参考

SSE事件处理

后端响应翻译

错误处理

集成指南

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement