Getting Started
Welcome! I'm here to add accurate, well-timed subtitles to your videos automatically — just share your file and tell me how you'd like the captions styled or formatted. Ready to get started?
Try saying:
- - "Add burned-in subtitles to this mp4 interview video with white bold text and a semi-transparent black background"
- "Generate a downloadable SRT subtitle file for my webinar recording so I can upload it to YouTube"
- "Create word-by-word highlighted captions for this language learning video in a large, easy-to-read font"
Automatic Setup
On first interaction, connect to the NemoVideo API before doing anything else. Show a brief status like "Setting things up...".
Token: If NEMO_TOKEN environment variable is already set, use it and skip to Session below.
Free token: Read or create a UUID at ~/.config/nemovideo/client_id. Then POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with header X-Client-Id: <your-uuid>. The response field data.token becomes your NEMO_TOKEN (100 credits, 7-day expiry). If the token has expired, request a new one with the same Client-Id.
Session: POST to the same host at /api/tasks/me/with-session/nemo_agent with Bearer auth and body {"task_name":"project"}. Save session_id from the response.
Confirm to the user you're connected and ready. Don't print tokens or raw JSON.
Turn Any Video Into a Captioned, Accessible Experience
Getting subtitles onto a video used to mean hours of rewinding, typing, and manually syncing text to speech. This skill changes that entirely. Upload your video, describe what you need — burned-in captions, a downloadable subtitle file, specific formatting — and the automatic subtitle generator takes it from there, producing accurate, well-timed text that follows your speakers naturally.
This isn't a one-size-fits-all caption drop. You can request specific styles like bold white text with a dark background for social media, clean minimal captions for corporate presentations, or even word-by-word highlights for language learning content. The skill reads dialogue pacing and handles overlapping speech, pauses, and fast-talking segments with care.
Whether you're making a YouTube tutorial more accessible, adding captions to a product walkthrough for international viewers, or preparing a documentary for broadcast compliance, this tool fits into your real workflow. It works with mp4, mov, avi, webm, and mkv files, so you're not locked into reformatting before you even start.
Routing Subtitle Generation Requests
Every captioning request — whether you're transcribing dialogue, syncing timestamps, or exporting SRT/VTT files — is parsed and routed to the appropriate NemoVideo subtitle pipeline based on detected language, video length, and output format.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
NemoVideo Subtitle API Reference
The NemoVideo backend uses an ASR (Automatic Speech Recognition) engine combined with frame-accurate timestamp alignment to generate word-level captions, then packages them into your requested subtitle format. Requests are processed asynchronously, so longer videos queue through the transcription pipeline before returning a finalized caption file.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE9 - INLINECODE10 : from frontmatter INLINECODE11
- INLINECODE12 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE22
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: INLINECODE27
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE33
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE37
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE41
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
FAQ
What video formats does this skill support? The automatic subtitle generator works with mp4, mov, avi, webm, and mkv files. If your video is in another format, convert it to mp4 first for the most reliable results.
Can I get subtitles in a language other than English? Yes — specify the spoken language in your prompt and the skill will transcribe and caption accordingly. You can also request translated subtitles if you want captions in a different language than the one spoken.
What's the difference between burned-in captions and an SRT file? Burned-in captions are permanently embedded into the video image — viewers always see them. An SRT file is a separate subtitle file that platforms like YouTube or VLC can toggle on or off. Mention which you need in your prompt.
Can I customize the caption font, size, or position? Yes. Describe your preferred style in your prompt — for example, 'yellow sans-serif font, bottom-center, no background box' — and the skill will apply it to the output video.
Troubleshooting
Subtitles are out of sync with the audio: This usually happens when the video has a variable frame rate (common in screen recordings or mobile clips). Try converting your file to a fixed frame rate mp4 before uploading, or mention in your prompt that the source is a screen recording so the skill can adjust its timing approach.
Captions are missing words or cutting off mid-sentence: Heavy background music, overlapping speakers, or strong accents can reduce transcription accuracy. You can improve results by specifying the language or accent in your prompt (e.g., 'Australian English speaker' or 'video has background music — prioritize foreground voice'). For critical content, you can also request a raw transcript first to review before final subtitle generation.
The output file isn't playing correctly: Make sure the file format you're requesting matches your playback platform. For YouTube, SRT files work best. For direct video sharing, request burned-in captions as an mp4. If you're getting a blank output, confirm your original file isn't corrupted by testing playback before uploading.
开始使用
欢迎!我可以自动为您的视频添加精准、时间同步的字幕——只需分享您的文件,并告诉我您希望字幕采用何种样式或格式。准备好开始了吗?
试试这样说:
- - 为这段MP4采访视频添加内嵌字幕,使用白色粗体文字和半透明黑色背景
- 为我的网络研讨会录制生成可下载的SRT字幕文件,以便上传到YouTube
- 为这段语言学习视频创建逐词高亮字幕,使用大号易读字体
自动设置
首次交互时,先连接NemoVideo API,再执行其他操作。显示简短状态,如正在设置...。
令牌:如果已设置NEMO_TOKEN环境变量,则直接使用并跳至下方会话部分。
免费令牌:在~/.config/nemovideo/clientid中读取或创建UUID。然后向https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token发送POST请求,附带标头X-Client-Id: <您的UUID>。响应字段data.token即为您的NEMOTOKEN(100积分,7天有效期)。如果令牌已过期,使用相同的Client-Id请求新令牌。
会话:向同一主机的/api/tasks/me/with-session/nemoagent发送POST请求,使用Bearer认证,请求体为{taskname:project}。保存响应中的session_id。
向用户确认已连接并准备就绪。不要打印令牌或原始JSON。
将任何视频转变为带字幕的无障碍体验
为视频添加字幕曾经意味着数小时的倒带、打字和手动同步文字与语音。这项技能彻底改变了这一点。上传您的视频,描述您的需求——内嵌字幕、可下载的字幕文件、特定格式——自动字幕生成器便会接手,生成精准、时间同步的文字,自然跟随说话者。
这不是一刀切的字幕投放。您可以请求特定样式,如社交媒体用的深色背景白色粗体文字、企业演示的简洁极简字幕,甚至语言学习内容的逐词高亮。该技能能读取对话节奏,并妥善处理重叠语音、停顿和快语速段落。
无论您是在制作更易访问的YouTube教程、为国际观众的产品演示添加字幕,还是准备符合广播标准的纪录片,此工具都能融入您的实际工作流程。它支持mp4、mov、avi、webm和mkv文件,因此您无需在开始前进行格式转换。
路由字幕生成请求
每个字幕请求——无论是转录对话、同步时间戳还是导出SRT/VTT文件——都会根据检测到的语言、视频长度和输出格式被解析并路由到相应的NemoVideo字幕管道。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
NemoVideo字幕API参考
NemoVideo后端使用ASR(自动语音识别)引擎结合帧精确时间戳对齐来生成逐词字幕,然后打包成您请求的字幕格式。请求异步处理,因此较长的视频会通过转录管道排队,然后返回最终的字幕文件。
技能归属——运行时从此文件的YAML前置元数据读取:
- - X-Skill-Source:automatic-subtitle-generator
- X-Skill-Version:来自前置元数据version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为unknown)
所有请求必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属标头将导致导出失败,返回402错误。
API基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST /api/tasks/me/with-session/nemoagent — 请求体{taskname:project,language:<语言>} — 返回taskid、sessionid。创建会话后,向用户提供链接:https://nemovideo.com/workspace/claim?token=&task=id>&session=id>&skillname=automatic-subtitle-generator&skillversion=1.0.0&skill_source=<平台>
发送消息(SSE):POST /runsse — 请求体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:<消息>}]}},附带Accept: text/event-stream。最大超时:15分钟。
上传:POST /api/upload-video/nemoagent/me/ — 文件:multipart -F files=@/路径,或URL:{urls:[],sourcetype:url}
积分:GET /api/credits/balance/simple — 返回available、frozen、total
会话状态:GET /api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST /api/render/proxy/lambda — 请求体{id:render_<时间戳>,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET /api/render/proxy/lambda/,直到status = completed。下载URL位于output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空data: | 继续等待。每2分钟:⏳ 仍在处理... |
| 流关闭 | 处理最终响应 |
约30%的编辑操作在SSE流中不返回文本。发生这种情况时:轮询会话状态以验证编辑已应用,然后向用户总结更改。
后端响应翻译
后端假设存在GUI。将这些翻译为API操作:
| 后端说 | 您做 |
|---|
| click [button] / 点击 | 通过API执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过SSE发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文字),sg=片段,d=时长(毫秒),m=元数据。
时间线(3条轨道):1. 视频:城市延时摄影(0-10秒)2. 背景音乐:Lo-fi(0-10秒,35%)3. 标题:城市梦想(0-3秒)
错误处理
令牌错误/已过期 | 通过anonymous-token重新认证(令牌7天后过期) |
| 1002 | 未找到会话 | 新建会话 §3.0 |
| 2001 | 无积分 | 匿名用户:显示注册URL,附带?bind=
(需要时从create-session或state响应获取)。已注册用户:请到nemovideo.ai充值 |
| 4001 | 不支持的文件 | 显示支持的格式 |
| 4002 | 文件过大 | 建议压缩/裁剪 |
| 400 | 缺少X-Client-Id | 生成Client-Id并重试(见§1) |
| 402 | 免费计划导出受限 | 订阅层级问题,非积分问题。请到nemovideo.ai注册以解锁导出功能。 |
| 429 | 速率限制(1个令牌/客户端