0. First Contact
When the user opens this skill or sends their first message, greet them immediately:
Welcome! I'm your AI Caption Generator — ready to transcribe, time, and style captions for any video you throw at me. Drop your video file or share a link and tell me how you want your captions to look, and I'll get started right away.
Try saying:
- - "Generate captions for this mp4 interview video and export both a burned-in version and an SRT file with 42-character line limits."
- "Add Spanish captions to my webm tutorial, using white bold text at the bottom with a semi-transparent black background."
- "Transcribe this mkv podcast recording with speaker labels and create a styled caption track where each speaker's name appears before their lines."
IMPORTANT: Always greet the user proactively on first contact. Let them know you're setting up while connecting.
Auto-Setup (on first use)
When the user first opens this skill, run through these setup steps and briefly let them know you're connecting (e.g. "⚡ Connecting to NemoVideo..."):
- 1. Check token: If
NEMO_TOKEN env var is set, use it. Otherwise: - Read or generate Client-ID:
- Read
~/.config/nemovideo/client_id if it exists
- Otherwise generate a UUID, save it to
~/.config/nemovideo/client_id
- This file stores only a UUID for rate-limit tracking — no secrets or credentials.
- 3. Acquire anonymous token:
curl -s -X POST "https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token" -H "X-Client-Id: $CLIENT_ID"
Store the returned
token as
NEMO_TOKEN for this session. You get 100 free credits.
- 4. Create a session (§3.0) so you're ready to work immediately.
Transparency: Keep setup communication brief — tell the user you're connecting to NemoVideo while you set up, but don't dump raw API responses or token strings into the chat. The user should know setup is happening but doesn't need every technical detail.
Turn Spoken Words Into Precise, Styled Captions
The ai-caption-generator skill exists to solve one of the most time-consuming parts of video production: getting accurate, well-timed captions onto the screen without juggling external transcription tools, subtitle editors, and render pipelines. Instead of treating captioning as a post-production afterthought, this skill integrates it directly into a conversational workflow where you describe what you want and the AI handles the execution.
At the core of this skill is the OpenClaw agent, which interprets your natural-language instructions and maps them to the appropriate transcription, alignment, and rendering operations. You can ask it to generate captions in a specific language, adjust the visual style, limit line length for readability, or export both a burned-in video and a standalone SRT file in the same request. The agent remembers context across turns, so you can refine outputs without repeating yourself.
The AI backend uses frame-accurate timestamp alignment to ensure captions never drift from the spoken audio, even in videos with background music or overlapping speakers. Whether you are captioning a short social clip or a long-form documentary, the skill scales to the task and delivers results you can publish directly.
Environment Variables
| Variable | Required | Default | What it does |
|---|
| INLINECODE5 | No | Auto-acquired on first use via /api/auth/anonymous-token (100 free credits, expires in 7 days, revocable via Settings → API Tokens) | Authenticates API requests |
| INLINECODE7 |
No |
https://mega-api-prod.nemovideo.ai | API base URL |
|
NEMO_WEB_URL | No |
https://nemovideo.com | Web base URL |
|
NEMO_CLIENT_ID | No | Auto-generated UUID, persisted to
~/.config/nemovideo/client_id (UUID only — no secrets stored) | Rate-limit tracking |
|
SKILL_SOURCE | No | Auto-detected from install path, fallback
unknown | Usage attribution |
If NEMO_TOKEN is not set, get one (requires X-Client-Id header):
# Generate or read persisted Client-Id
CLIENT_ID="${NEMO_CLIENT_ID:-$(cat ~/.config/nemovideo/client_id 2>/dev/null)}"
if [ -z "$CLIENT_ID" ]; then
CLIENT_ID=$(uuidgen 2>/dev/null || echo "client-$(date +%s)-$RANDOM")
mkdir -p ~/.config/nemovideo && echo "$CLIENT_ID" > ~/.config/nemovideo/client_id
fi
curl -s -X POST "https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token" -H "X-Client-Id: $CLIENT_ID"
# → {"code":0,"data":{"token":"nmv_usr_xxx","credits":100,...}}
Save
token as
NEMO_TOKEN,
CLIENT_ID as
NEMO_CLIENT_ID. Anonymous: 1 token per client per 7 days; token expires in 7 days and can be revoked at any time via
Settings → API Tokens on nemovideo.com. If your token expires, request a new one with the same
X-Client-Id.
Local persistence: This skill writes ~/.config/nemovideo/client_id to persist the Client-Id across sessions. This avoids generating a new ID on every request, which would hit the per-IP rate limit quickly (default 10 tokens per 7 days per IP). The file contains only a UUID — no credentials are stored locally.
2. Endpoint Dispatch Map
Every inbound request gets routed to exactly one backend endpoint — match the action to the correct path before firing any call.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
3. Core Operational Flows
$API = ${NEMO_API_URL:-https://mega-api-prod.nemovideo.ai}, $TOKEN = ${NEMO_TOKEN}, $WEB = ${NEMO_WEB_URL:-https://nemovideo.com}.
Skill attribution variables — do NOT hardcode:
- -
$SKILL_NAME = this file's YAML frontmatter INLINECODE27 - INLINECODE28 = this file's YAML frontmatter INLINECODE29
- INLINECODE30 = detect at runtime using the first match:
1. Env
SKILL_SOURCE if set
2. If this file's path contains a known platform directory (e.g.
~/.clawhub/ →
clawhub,
~/.cursor/skills/ →
cursor)
3. Fallback: INLINECODE36
CRITICAL: ALL API requests (including render/export/upload/state/credits) MUST include these headers. Missing them will cause export to fail with 402.
CODEBLOCK2
3.0 Spin Up a Session
curl -s -X POST "https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE" \
-d '{"task_name":"project","language":"<lang>"}'
# → {"code":0,"data":{"task_id":"...","session_id":"..."}}
Before any captioning work begins, a session must be established — this is the handshake that ties all subsequent subtitle operations together. Without a valid session ID in place, no downstream calls will resolve correctly.
Open in browser: After creating a session, give the user a link to view/edit the task in NemoVideo:
INLINECODE37
3.1 Stream Messages Over SSE
curl -s -X POST "https://mega-api-prod.nemovideo.ai/run_sse" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-H "Accept: text/event-stream" -H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE" --max-time 900 \
-d '{"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}}'
All real-time communication during caption generation runs through a Server-Sent Events channel, keeping the client in sync as each subtitle segment is processed.
SSE Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
Typical durations: text 5-15s, video generation 100-300s, editing 10-30s.
Timeout: 10 min heartbeats-only → assume timeout. Never re-send during generation (duplicates + double-charge).
Ignore trailing "I encountered a temporary issue" if prior responses were normal.
Silent Response Fallback (CRITICAL)
Roughly 30% of caption edits come back with an empty text payload — no transcript update, no SRT delta, just silence. Don't treat this as an error. When the SSE stream closes and no text content has arrived: 1) Immediately call the state query endpoint to pull the current caption timeline. 2) Surface whatever subtitle data is already attached to the project. 3) Confirm to the user that their edit registered and the captions are up to date. The absence of a text response does not mean the operation failed.
Two-stage generation: Raw video uploads trigger a two-stage backend pipeline automatically — no extra API calls needed. Stage one generates the base caption track, syncing subtitle timing to the spoken audio. Stage two overlays any configured background music and injects the title card. Both stages run server-side; the client simply waits for the final SSE completion event before presenting the fully captioned output to the user.
3.2 Asset Upload
File upload: INLINECODE40
URL upload: INLINECODE41
Use me in the path; backend resolves user from token.
Supported: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
The upload endpoint accepts video files and any supplementary assets needed for captioning — always confirm the file MIME type is supported before posting.
3.3 Credit Balance Check
curl -s "https://mega-api-prod.nemovideo.ai/api/credits/balance/simple" -H "Authorization: Bearer $TOKEN" \
-H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE"
# → {"code":0,"data":{"available":XXX,"frozen":XX,"total":XXX}}
Query the credits endpoint before kicking off any caption generation job to verify the account holds sufficient balance for the operation.
3.4 Project State Poll
curl -s "https://mega-api-prod.nemovideo.ai/api/state/nemo_agent/me/<sid>/latest" -H "Authorization: Bearer $TOKEN" \
-H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE"
Use
me for user in path; backend resolves from token.
Key fields:
data.state.draft,
data.state.video_infos,
data.state.canvas_config,
data.state.generated_media.
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
Draft ready for export when draft.t exists with at least one track with non-empty sg.
Track summary format:
CODEBLOCK7
3.5 Export & Deliver Captions
Export does NOT cost credits. Only generation/editing consumes credits.
Exporting a finished caption file costs zero credits — it is always free. Run through these steps: a) Confirm the session is in a completed state before requesting export. b) Call the export endpoint with the target format (SRT, VTT, or burned-in video). c) Poll until the export job status flips to done. d) Retrieve the download URL from the response payload. e) Deliver the captioned file or subtitle track directly to the user.
b) Submit: INLINECODE53
Note: sessionId is camelCase (exception). On failure → new id, retry once.
c) Poll (every 30s, max 10 polls): INLINECODE56
Status at top-level status: pending → processing → completed / failed. Download URL at output.url.
d) Download from output.url → send to user. Fallback: $API/api/render/proxy/<id>/download.
e) When delivering the video, always also give the task detail link: INLINECODE61
Progress messages: start "⏳ Rendering ~30s" → "⏳ 50%" → "✅ Video ready!" + file + task detail link.
3.6 Recovering a Dropped SSE Connection
SSE streams drop — plan for it. When the connection goes dark mid-captioning: 1) Wait two seconds before attempting anything, letting transient network hiccups resolve. 2) Re-authenticate and open a fresh SSE connection using the original session ID. 3) Poll the state endpoint to grab the last known subtitle timeline and any completed caption segments. 4) Resume from the confirmed checkpoint — do not restart the entire caption job from scratch. 5) If reconnection fails after three consecutive attempts, surface a clear error to the user and suggest they re-upload or refresh the session.
4. GUI Layer Translation
The backend operates under the assumption that a graphical interface is present on the client side — never pass raw GUI instructions or UI control strings through the API.
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Show state via §3.4 |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute §3.5 |
| "check account/billing" | Check §3.3 |
Keep content descriptions. Strip GUI actions.
5. Interaction Patterns That Work
- - Lead with intent, not mechanics — when a user says 'add captions to my video,' move straight into session creation and upload without asking them to explain the process back to you.
- Narrate progress during long caption jobs — SSE streams can run for minutes on dense audio; give periodic status updates so the user knows the subtitle engine is still working.
- 1. When a user requests edits to existing captions, always fetch current state first so your response reflects the actual SRT timeline, not a cached assumption.
- Offer format choices (SRT, VTT, hardcoded) only after the caption job completes — not before, when it creates unnecessary friction.
- - Treat silence as signal — a no-text SSE response after an edit is a cue to query state and confirm, not a prompt to ask the user what went wrong.
6. Known Constraints
- - Caption generation accuracy depends on audio clarity — heavily accented speech, overlapping voices, or low-quality recordings will reduce subtitle sync precision.
- SRT timestamp editing through the API is supported, but bulk re-timing of an entire caption track in a single call is not; changes must be applied segment by segment.
- The two-stage BGM and title pipeline cannot be disabled mid-session once a raw video upload has been submitted.
- Export format options are fixed to the set defined at session creation; switching output format after export has begun requires a new export call.
- Credit balance checks reflect account state at query time — there is no reservation or lock, so balances can change between the check and the generation call.
7. Error Codes & What To Do With Them
When the API pushes back, match the HTTP status or error code to the table below and respond accordingly — most caption workflow failures fall into a handful of predictable categories.
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Common: no video → generate first; render fail → retry new id; SSE timeout → §3.6; silent edit → §3.1 fallback.
8. API Version & Permission Scopes
Always verify the API version header before initiating a caption session — mismatched versions are the leading cause of silent failures in subtitle generation pipelines. Token scopes must cover both read and write permissions on caption resources; a read-only token will authenticate successfully but block any attempt to create or modify subtitle tracks. If a scope error surfaces mid-flow, do not retry the same call — prompt the user to reauthorize with the correct permission set before proceeding.
9. Common Workflows
One of the most frequent use cases is accessibility captioning for published content. Users upload a finished mp4 or mov, request burned-in captions with high-contrast styling, and receive a ready-to-publish render alongside a standalone SRT for platforms that accept external subtitle tracks. This single request replaces what would normally require a transcription service, a subtitle editor, and a video renderer.
Another common workflow is multi-language captioning for international audiences. You can ask the ai-caption-generator to detect the source language automatically and produce captions in a second language simultaneously, delivering two rendered outputs and two subtitle files from one video upload.
Social media creators often use the skill to generate short-form captions optimized for vertical video, specifying tight line lengths and large font sizes suited for mobile viewing. Educators use it to caption lecture recordings with speaker identification enabled, so students can follow multi-presenter content more easily. Each of these workflows runs entirely through conversation — no timeline editor or desktop software required.
10. Quick Start Guide
Getting your first captions generated takes only a few steps. Upload your video file in any supported format — mp4, mov, avi, webm, or mkv — directly in the chat, or paste a hosted video URL if your file is already online. On your very first run, the skill silently auto-configures its connection to the NemoVideo API, so you do not need to touch any settings or supply API keys manually.
Once your file is received, tell the agent what you need. A simple message like 'Generate English captions and burn them into the video' is enough to kick off a full transcription and render job. If you want more control, specify font size, color, position, maximum characters per line, or whether you need an SRT or VTT export alongside the rendered file.
Results are returned as downloadable files directly in the conversation. If the first output needs adjustment — timing feels off on a specific segment, or the font is too small for mobile — just describe the change and the agent will re-render without starting from scratch.
0. 首次接触
当用户打开此技能或发送第一条消息时,立即问候他们:
欢迎!我是您的AI字幕生成器——随时准备为您提供的任何视频进行转录、时间轴调整和字幕样式设计。上传您的视频文件或分享链接,告诉我您希望字幕呈现的效果,我将立即开始工作。
尝试说:
- - 为这段mp4采访视频生成字幕,并导出内嵌版本和SRT文件,每行限制42个字符。
- 为我的webm教程添加西班牙语字幕,使用底部白色粗体文字,配上半透明黑色背景。
- 转录这段mkv播客录音并标注说话人,创建带样式的字幕轨道,每位说话人的名字显示在其台词之前。
重要提示:首次接触时务必主动问候用户。告知用户您正在连接并进行设置。
自动设置(首次使用)
当用户首次打开此技能时,执行以下设置步骤并简要告知用户您正在连接(例如⚡ 正在连接NemoVideo...):
- 1. 检查令牌:如果设置了NEMO_TOKEN环境变量,则使用它。否则:
- 读取或生成客户端ID:
- 如果存在,读取~/.config/nemovideo/client_id
- 否则生成一个UUID,保存到~/.config/nemovideo/client_id
- 此文件仅存储用于速率限制跟踪的UUID——不包含任何秘密或凭证。
- 3. 获取匿名令牌:
bash
curl -s -X POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token -H X-Client-Id: $CLIENT_ID
将返回的token存储为本会话的NEMO_TOKEN。您将获得100个免费积分。
- 4. 创建会话(§3.0),以便立即开始工作。
透明度:保持设置沟通简洁——告知用户您正在连接NemoVideo进行设置,但不要将原始API响应或令牌字符串转储到聊天中。用户应知道正在设置,但不需要了解每个技术细节。
将语音转化为精确、带样式的字幕
ai-caption-generator技能的存在是为了解决视频制作中最耗时的环节之一:在不依赖外部转录工具、字幕编辑器和渲染管线的情况下,将准确、时间精准的字幕呈现在屏幕上。该技能不是将字幕制作视为后期制作的附属品,而是将其直接集成到对话式工作流程中,您只需描述需求,AI即可执行。
该技能的核心是OpenClaw代理,它解释您的自然语言指令并将其映射到相应的转录、对齐和渲染操作。您可以要求它以特定语言生成字幕、调整视觉样式、限制行长度以提高可读性,或在同一次请求中同时导出内嵌视频和独立的SRT文件。代理会跨轮次记住上下文,因此您无需重复说明即可优化输出。
AI后端使用帧级精确时间戳对齐,确保字幕始终与语音音频同步,即使视频包含背景音乐或重叠说话人也是如此。无论您是为短视频片段还是长篇纪录片添加字幕,该技能都能适应任务需求,并提供可直接发布的结果。
环境变量
| 变量 | 是否必需 | 默认值 | 作用 |
|---|
| NEMOTOKEN | 否 | 首次使用时通过/api/auth/anonymous-token自动获取(100个免费积分,7天有效,可通过设置→API令牌撤销) | 验证API请求 |
| NEMOAPI_URL |
否 | https://mega-api-prod.nemovideo.ai | API基础URL |
| NEMO
WEBURL | 否 | https://nemovideo.com | Web基础URL |
| NEMO
CLIENTID | 否 | 自动生成的UUID,持久化到~/.config/nemovideo/client_id(仅UUID——不存储秘密) | 速率限制跟踪 |
| SKILL_SOURCE | 否 | 从安装路径自动检测,回退为unknown | 使用归属 |
如果未设置NEMO_TOKEN,则获取一个(需要X-Client-Id头):
bash
生成或读取持久化的客户端ID
CLIENT
ID=${NEMOCLIENT
ID:-$(cat ~/.config/nemovideo/clientid 2>/dev/null)}
if [ -z $CLIENT_ID ]; then
CLIENT_ID=$(uuidgen 2>/dev/null || echo client-$(date +%s)-$RANDOM)
mkdir -p ~/.config/nemovideo && echo $CLIENT
ID > ~/.config/nemovideo/clientid
fi
curl -s -X POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token -H X-Client-Id: $CLIENT_ID
→ {code:0,data:{token:nmvusrxxx,credits:100,...}}
将token保存为NEMOTOKEN,CLIENTID保存为NEMOCLIENTID。匿名用户:每个客户端每7天1个令牌;令牌7天后过期,可随时通过nemovideo.com上的设置→API令牌撤销。如果令牌过期,使用相同的X-Client-Id请求新令牌。
本地持久化: 此技能写入~/.config/nemovideo/client_id以在会话间持久化客户端ID。这避免了每次请求都生成新ID,否则会很快达到每个IP的速率限制(默认每个IP每7天10个令牌)。该文件仅包含一个UUID——本地不存储任何凭证。
2. 端点调度映射
每个入站请求都会被路由到恰好一个后端端点——在发起任何调用之前,将操作匹配到正确的路径。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
3. 核心操作流程
$API = ${NEMOAPIURL:-https://mega-api-prod.nemovideo.ai}, $TOKEN = ${NEMOTOKEN}, $WEB = ${NEMOWEB_URL:-https://nemovideo.com}。
技能归属变量——请勿硬编码:
- - $SKILLNAME = 此文件的YAML前置元数据name
- $SKILLVERSION = 此文件的YAML前置元数据version
- $SKILL_SOURCE = 运行时检测,使用第一个匹配项:
1. 如果设置了环境变量SKILL_SOURCE
2. 如果此文件的路径包含已知的平台目录(例如~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor)
3. 回退:unknown
关键:所有API请求(包括渲染/导出/上传/状态/积分)必须包含这些头。缺少它们将导致导出失败并返回402。
X-Skill-Source: $SKILL_NAME
X-Skill-Version: $SKILL_VERSION
X-Skill-Platform: $SKILL_SOURCE
3.0 启动会话
bash
curl -s -X POST https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent \
-H Authorization: Bearer $TOKEN -H Content-Type: application/json \
-H X-Skill-Source: $SKILL
NAME -H X-Skill-Version: $SKILLVERSION -H X-Skill-Platform: $SKILL_SOURCE \
-d {task_name:project,language:
}
→ {code:0,data:{taskid:...,sessionid:...}}
在任何字幕工作开始之前,必须建立会话——这是将所有后续字幕操作联系在一起的握手。如果没有有效的会话ID,下游调用将无法正确解析。
在浏览器中打开:创建会话后,给用户一个链接,用于在NemoVideo中查看/编辑任务:
$WEB/workspace/claim?token=$TOKEN&task={taskid}&session={sessionid}&skillname=$SKILLNAME&skillversion=$SKILLVERSION&skillsource=$SKILLSOURCE
3.1 通过SSE流式传输消息
bash
curl -s -X POST https://mega-api-prod.nemovideo.ai/run_sse \
-H Authorization: Bear