Getting Started
Welcome! Ready to add captions to your video for free? Upload your mp4, mov, avi, webm, or mkv file and I'll generate accurate subtitles and embed them directly — just tell me your video and any caption preferences to get started.
Try saying:
- - "Add captions to my interview video and make the font large enough to read on mobile"
- "Generate subtitles for this tutorial clip and position them at the bottom center of the frame"
- "Transcribe and caption my product demo video — keep the style clean and minimal"
Quick Start Setup
This skill connects to NemoVideo's cloud backend. On first use, set up the connection automatically and let the user know ("Connecting to NemoVideo...").
Token check: Look for NEMO_TOKEN in the environment. If found, skip to session creation. Otherwise:
- - Read
~/.config/nemovideo/client_id or generate a UUID and save it there - POST
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with X-Client-Id header - Extract
data.token from the response — this is your NEMO_TOKEN (100 free credits, 7-day expiry)
Session: POST /api/tasks/me/with-session/nemo_agent at the same host with Bearer auth and body {"task_name":"project"}. Keep the returned session_id for all operations.
Let the user know with a brief "Ready!" when setup is complete. Don't expose tokens or raw API output.
Captions for Every Video, Zero Cost Involved
Getting captions onto your videos used to mean paying for transcription services, wrestling with subtitle editors, or spending hours syncing text frame by frame. This skill changes that entirely. Upload your video, and it handles the heavy lifting — detecting spoken words, generating accurate caption text, and embedding those subtitles directly into your footage.
Whether you're posting a tutorial on YouTube, sharing a product demo on LinkedIn, or making a short clip accessible for hearing-impaired viewers, captions make a measurable difference in how your content performs and who it reaches. Studies consistently show that captioned videos hold viewer attention longer and perform better in feeds where autoplay runs silently.
This skill supports the most common video formats — mp4, mov, avi, webm, and mkv — so you don't need to convert anything before uploading. The result is a clean, captioned video file you can download and publish immediately. No subscriptions, no watermarks, no technical setup required.
Caption Request Routing Logic
When you submit a video URL or upload a file, the skill automatically detects your intent — whether you need auto-subtitles, SRT export, burned-in captions, or multi-language transcription — and routes your request to the matching NemoVideo endpoint.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
NemoVideo API Reference
The NemoVideo backend powers all subtitle generation by running speech-to-text transcription, frame-synced caption alignment, and optional style rendering server-side — so no local processing is required. Supported formats include SRT, VTT, ASS, and hardcoded MP4 output with configurable font and placement settings.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE9 - INLINECODE10 : from frontmatter INLINECODE11
- INLINECODE12 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE22
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: INLINECODE27
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE33
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE37
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE41
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Performance Notes
Caption accuracy depends heavily on audio quality in your source video. Clear speech with minimal background noise produces the best transcription results — think recorded voiceovers, studio interviews, or screen recordings with system audio. Videos with heavy background music, overlapping speakers, or strong accents may produce captions that need light review before publishing.
Longer videos take more processing time than short clips. For a 5-minute video, expect a quick turnaround. For videos over 30 minutes, give the skill a moment to work through the full transcription before the captioned file is ready. Supported formats include mp4, mov, avi, webm, and mkv — files outside these formats should be converted before uploading to ensure smooth processing.
Best Practices
For the sharpest captions, trim your video before uploading — removing long silences or dead air at the start and end helps the transcription engine focus on the actual spoken content. If your video contains technical jargon, product names, or industry-specific terms, mention them when submitting your request so caption output can be reviewed with those terms in mind.
Choose a caption style that suits your platform. Bold white text with a subtle drop shadow reads well on most backgrounds and works across mobile and desktop. If you're captioning for social media, shorter caption lines with faster refresh rates tend to feel more natural to viewers scrolling quickly through a feed.
Always watch the captioned output once before publishing. Even highly accurate auto-captions occasionally mishear a word, and a quick review catches anything that needs a small correction before your video goes live.
开始使用
欢迎!准备好免费为您的视频添加字幕了吗?上传您的 mp4、mov、avi、webm 或 mkv 文件,我将生成准确的字幕并直接嵌入其中——只需告诉我您的视频和任何字幕偏好即可开始。
试试说:
- - 为我的采访视频添加字幕,并将字体调大,方便在手机上阅读
- 为这个教程片段生成字幕,并将其放置在画面底部中央
- 转录并为我的产品演示视频添加字幕——保持风格简洁清爽
快速启动设置
此技能连接到 NemoVideo 的云端后端。首次使用时,自动设置连接并告知用户(正在连接 NemoVideo...)。
令牌检查:在环境中查找 NEMO_TOKEN。如果找到,直接跳转到会话创建。否则:
- - 读取 ~/.config/nemovideo/clientid 或生成一个 UUID 并保存到该位置
- 使用 X-Client-Id 请求头发送 POST 请求到 https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token
- 从响应中提取 data.token——这就是您的 NEMOTOKEN(100 个免费积分,7 天有效期)
会话:在同一主机上使用 Bearer 认证和请求体 {taskname:project} 发送 POST 请求到 /api/tasks/me/with-session/nemoagent。保留返回的 session_id 用于所有操作。
设置完成后,用简短的准备就绪!告知用户。不要暴露令牌或原始 API 输出。
为每个视频添加字幕,零成本
过去为视频添加字幕意味着要付费购买转录服务、与字幕编辑器搏斗,或花费数小时逐帧同步文本。这项技能彻底改变了这一点。上传您的视频,它处理所有繁重的工作——检测语音、生成准确的字幕文本,并将这些字幕直接嵌入到您的视频中。
无论您是在 YouTube 上发布教程、在 LinkedIn 上分享产品演示,还是制作一个方便听障观众观看的短视频,字幕都会对您内容的传播效果和受众范围产生显著影响。研究一致表明,带字幕的视频能更长时间地保持观众注意力,并且在静音自动播放的信息流中表现更佳。
此技能支持最常见的视频格式——mp4、mov、avi、webm 和 mkv——因此您在上传前无需进行任何格式转换。最终输出的是一个干净、带字幕的视频文件,您可以立即下载和发布。无需订阅、无水印、无需任何技术设置。
字幕请求路由逻辑
当您提交视频 URL 或上传文件时,技能会自动检测您的意图——无论您需要自动字幕、SRT 导出、硬编码字幕还是多语言转录——并将您的请求路由到匹配的 NemoVideo 端点。
| 用户说... | 操作 | 跳过 SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
NemoVideo API 参考
NemoVideo 后端通过运行语音转文本转录、帧同步字幕对齐以及可选的样式渲染来驱动所有字幕生成——所有这些都在服务器端完成,因此无需本地处理。支持的格式包括 SRT、VTT、ASS 以及带有可配置字体和位置设置的硬编码 MP4 输出。
技能归属——运行时从此文件的 YAML 前置元数据中读取:
- - X-Skill-Source:video-caption-generator-free
- X-Skill-Version:来自前置元数据 version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为 unknown)
所有请求必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属标头将导致导出失败并返回 402 错误。
API 基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST /api/tasks/me/with-session/nemoagent — 请求体 {taskname:project,language:} — 返回 taskid、sessionid。创建会话后,给用户一个链接:https://nemovideo.com/workspace/claim?token=$TOKEN&task=id>&session=id>&skillname=video-caption-generator-free&skillversion=1.0.0&skill_source=
发送消息(SSE):POST /runsse — 请求体 {appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}} 并设置 Accept: text/event-stream。最大超时时间:15 分钟。
上传:POST /api/upload-video/nemoagent/me/ — 文件:multipart -F files=@/path,或 URL:{urls:[],sourcetype:url}
积分:GET /api/credits/balance/simple — 返回 available、frozen、total
会话状态:GET /api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST /api/render/proxy/lambda — 请求体 {id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每 30 秒轮询 GET /api/render/proxy/lambda/,直到 status = completed。下载 URL 位于 output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE 事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用 GUI 翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空 data: | 继续等待。每 2 分钟:⏳ 仍在处理中... |
| 流关闭 | 处理最终响应 |
约 30% 的编辑操作在 SSE 流中不返回文本。发生这种情况时:轮询会话状态以验证编辑是否已应用,然后向用户总结更改。
后端响应翻译
后端假定存在 GUI。将这些翻译为 API 操作:
| 后端说 | 您做 |
|---|
| click [button] / 点击 | 通过 API 执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过 SSE 发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=时长(毫秒),m=元数据。
时间线(3 条轨道):1. 视频:城市延时摄影(0-10 秒) 2. 背景音乐:Lo-fi(0-10 秒,35%) 3. 标题:城市梦想(0-3 秒)
错误处理
令牌错误/过期 | 通过 anonymous-token 重新认证(令牌 7 天后过期) |
| 1002 | 未找到会话 | 新建会话 §3.0 |
| 2001 | 无积分 | 匿名用户:显示带有 ?bind=
的注册 URL(需要时从 create-session 或 state 响应获取 )。已注册用户:请前往 nemovideo.ai 充值 |
| 4001 | 不支持的文件 | 显示支持的格式 |
| 4002 | 文件过大 | 建议压缩/裁剪 |
| 400 | 缺少 X-