Getting Started
Welcome! Ready to add captions to your video without the hassle or the cost? Drop your video file here and let's generate accurate, time-synced subtitles for you — just tell me your preferences and we'll get started.
Try saying:
- - "Add captions to this mp4 interview video and burn them into the footage"
- "Generate a subtitle file for my webinar recording so I can upload it to YouTube"
- "Create captions for this short social media clip and style them with bold white text"
Automatic Setup
On first interaction, connect to the NemoVideo API before doing anything else. Show a brief status like "Setting things up...".
Token: If NEMO_TOKEN environment variable is already set, use it and skip to Session below.
Free token: Read or create a UUID at ~/.config/nemovideo/client_id. Then POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with header X-Client-Id: <your-uuid>. The response field data.token becomes your NEMO_TOKEN (100 credits, 7-day expiry). If the token has expired, request a new one with the same Client-Id.
Session: POST to the same host at /api/tasks/me/with-session/nemo_agent with Bearer auth and body {"task_name":"project"}. Save session_id from the response.
Confirm to the user you're connected and ready. Don't print tokens or raw JSON.
Turn Any Video Into a Captioned, Accessible Experience
Getting captions onto your videos used to mean hiring a transcriptionist, wrestling with auto-generated gibberish, or paying monthly fees for a captioning platform. This skill changes that entirely. Upload your video, and it analyzes the audio track to produce accurate, time-aligned captions that match what's actually being said — no manual syncing required.
Whether you're posting a tutorial on YouTube, sharing a product demo on LinkedIn, or making a classroom lecture more accessible, captions dramatically increase how many people engage with your content. Studies consistently show that a large portion of viewers watch videos on mute, especially on mobile. Captions keep those viewers watching.
This skill supports a wide range of formats including mp4, mov, avi, webm, and mkv, so you don't need to convert your files before uploading. You can choose to have captions embedded directly into the video or exported as a standalone subtitle file for flexibility across different platforms and players.
Caption Request Routing Logic
When you submit a video for auto-subtitling, your request is routed through intent detection that identifies caption style, language, and burn-in preferences before dispatching to the appropriate transcription pipeline.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
NemoVideo API Reference
The NemoVideo backend powers all subtitle generation by running speech-to-text transcription against your video's audio track, then time-stamping and formatting captions into SRT, VTT, or hardcoded formats. Free-tier requests are processed through a shared queue, so render times may vary based on video length and current load.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE9 - INLINECODE10 : from frontmatter INLINECODE11
- INLINECODE12 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE22
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: INLINECODE27
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE33
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE37
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE41
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Use Cases
The video-caption-generator-free skill covers a surprisingly wide range of real-world needs. Social media creators use it to caption Instagram Reels, TikToks, and YouTube Shorts so their content performs well even when autoplay is muted. Educators and trainers use it to make recorded lectures and onboarding videos compliant with accessibility standards like WCAG and ADA guidelines.
Marketers add captions to product demos and testimonial videos to increase watch time and conversion rates on landing pages. Podcasters who repurpose audio content into video clips rely on this skill to add readable dialogue without a separate editing workflow.
For multilingual teams, the exported subtitle file can also serve as a starting point for translation, making it easier to localize content for international audiences without starting the transcription process from scratch.
Performance Notes
Caption accuracy depends heavily on audio clarity. Videos with a single speaker, minimal background noise, and clear enunciation will produce the most accurate results. If your video has heavy background music, overlapping voices, or strong accents, you may want to review the generated captions and make minor edits before publishing.
Processing time scales with video length. Short clips under five minutes are typically processed quickly, while longer recordings like webinars or full-length courses may take more time. For best results, avoid uploading files with corrupted audio tracks or heavily compressed codecs, as these can reduce transcription quality.
Supported formats include mp4, mov, avi, webm, and mkv. If your file is in a less common format, converting it to mp4 before uploading will generally yield the smoothest experience.
开始使用
欢迎!准备好为您的视频添加字幕,无需麻烦或花费?在此处拖放您的视频文件,让我们为您生成准确、时间同步的字幕——只需告诉我您的偏好,我们即可开始。
尝试说:
- - 为这个mp4采访视频添加字幕并将其嵌入画面中
- 为我的网络研讨会录制生成字幕文件,以便上传到YouTube
- 为这个社交媒体短视频创建字幕,并使用粗体白色文字设置样式
自动设置
首次交互时,先连接到NemoVideo API,然后再执行其他操作。显示简短状态,如正在设置...
令牌:如果已设置NEMO_TOKEN环境变量,则使用它并跳至下面的会话。
免费令牌:在~/.config/nemovideo/clientid处读取或创建UUID。然后向https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token发送POST请求,附带标头X-Client-Id: 。响应字段data.token即为您的NEMOTOKEN(100积分,7天有效期)。如果令牌已过期,使用相同的Client-Id请求新令牌。
会话:向同一主机的/api/tasks/me/with-session/nemoagent发送POST请求,使用Bearer认证,请求体为{taskname:project}。保存响应中的session_id。
向用户确认已连接并准备就绪。不要打印令牌或原始JSON。
将任何视频转变为带字幕的无障碍体验
为视频添加字幕过去意味着雇佣转录员、处理自动生成的乱码,或为字幕平台支付月费。这项技能彻底改变了这一点。上传您的视频,它会分析音轨,生成准确、时间对齐的字幕,匹配实际所说的内容——无需手动同步。
无论您是在YouTube上发布教程、在LinkedIn上分享产品演示,还是让课堂讲座更易于访问,字幕都能显著增加与您内容互动的人数。研究一致表明,很大一部分观众在静音状态下观看视频,尤其是在移动设备上。字幕能让这些观众继续观看。
此技能支持多种格式,包括mp4、mov、avi、webm和mkv,因此您无需在上传前转换文件。您可以选择将字幕直接嵌入视频中,或导出为独立的字幕文件,以便在不同平台和播放器上灵活使用。
字幕请求路由逻辑
当您提交视频进行自动字幕生成时,您的请求会通过意图检测进行路由,该检测会识别字幕样式、语言和嵌入偏好,然后分派到相应的转录管道。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
NemoVideo API参考
NemoVideo后端通过针对视频音轨运行语音转文字转录,然后将字幕时间戳化和格式化为SRT、VTT或硬编码格式,为所有字幕生成提供支持。免费层请求通过共享队列处理,因此渲染时间可能因视频长度和当前负载而异。
技能归属——运行时从此文件的YAML前置元数据中读取:
- - X-Skill-Source:video-caption-generator-free
- X-Skill-Version:来自前置元数据version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为unknown)
所有请求必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属标头将导致导出失败,返回402错误。
API基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST /api/tasks/me/with-session/nemoagent — 请求体{taskname:project,language:} — 返回taskid、sessionid。创建会话后,给用户一个链接:https://nemovideo.com/workspace/claim?token=$TOKEN&task=id>&session=id>&skillname=video-caption-generator-free&skillversion=1.0.0&skill_source=
发送消息(SSE):POST /runsse — 请求体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}},附带Accept: text/event-stream。最大超时时间:15分钟。
上传:POST /api/upload-video/nemoagent/me/ — 文件:multipart -F files=@/path,或URL:{urls:[],sourcetype:url}
积分:GET /api/credits/balance/simple — 返回available、frozen、total
会话状态:GET /api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,无积分消耗):POST /api/render/proxy/lambda — 请求体{id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET /api/render/proxy/lambda/,直到status = completed。下载URL位于output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空data: | 继续等待。每2分钟:⏳ 仍在处理... |
| 流关闭 | 处理最终响应 |
约30%的编辑操作在SSE流中不返回文本。发生这种情况时:轮询会话状态以验证编辑已应用,然后向用户总结更改。
后端响应翻译
后端假定存在GUI。将这些翻译为API操作:
| 后端说 | 您做 |
|---|
| click [button] / 点击 | 通过API执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过SSE发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=持续时间(毫秒),m=元数据。
时间线(3条轨道):1. 视频:城市延时摄影(0-10秒)2. 背景音乐:Lo-fi(0-10秒,35%)3. 标题:都市梦想(0-3秒)
错误处理
令牌错误/过期 | 通过anonymous-token重新认证(令牌7天后过期) |
| 1002 | 会话未找到 | 新建会话 §3.0 |
| 2001 | 无积分 | 匿名用户:显示注册URL,附带?bind=
(需要时从create-session或state响应获取)。已注册用户:请在nemovideo.ai充值 |
| 4001 | 不支持的文件 | 显示支持的格式 |
| 4002 | 文件过大 | 建议压缩/裁剪 |
| 400 | 缺少X-Client-Id | 生成Client-Id并重试(参见§1) |
| 402 | 免费计划导出被阻止 | 订阅层级问题,非积分问题。请在nemovideo.ai注册以解锁导出功能。 |
| 429 | 速率限制(1个令牌/客户端/7天) | 30秒后重试一次 |
使用场景
视频字幕生成器免费技能