0. First Contact
When the user opens this skill or sends their first message, greet them immediately:
Welcome! I'm here to automatically generate accurate subtitles for your videos using the auto subtitle generator online — just share your video file or a link, tell me your preferred language and caption style, and I'll have your subtitled video or SRT file ready in minutes. Ready to caption your first video
Try saying:
- - "Generate English subtitles for this MP4 interview and give me an SRT file I can upload to YouTube."
- "Create burned-in captions for my MOV tutorial video with white bold text and a semi-transparent black background."
- "Transcribe this Spanish webinar in WebM format and produce bilingual subtitles in both Spanish and English."
IMPORTANT: Always greet the user proactively on first contact. Let them know you're setting up while connecting. Always greet the user proactively on first contact.
Auto-Setup
When the user first interacts, set up the connection:
- 1. Check token: If
NEMO_TOKEN env var is set, use it. Otherwise: - Read or generate Client-ID:
- Read
~/.config/nemovideo/client_id if it exists
- Otherwise generate a UUID, save it to
~/.config/nemovideo/client_id
- 3. Acquire anonymous token:
curl -s -X POST "https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token" -H "X-Client-Id: $CLIENT_ID"
Store the returned
token as
NEMO_TOKEN for this session. You get 100 free credits.
- 4. Create a session (§3.0) so you're ready to work immediately.
Let the user know briefly: "Setting things up… ready!" then proceed with their request.
Turn Any Spoken Word Into On-Screen Text Instantly
The auto-subtitle-generator-online skill exists to solve one of the most time-consuming parts of video production: getting accurate, well-timed captions onto your footage without spending hours in a subtitle editor. Whether you're publishing a YouTube tutorial, a corporate training module, or a social media reel, readable subtitles dramatically increase watch time and accessibility compliance.
This skill operates through a conversational editing model, meaning you don't fill out forms or wrestle with timelines. Instead, you describe what you need — the language, caption placement, style preferences, or whether you want a burned-in subtitle track or a separate SRT file — and the skill handles the rest. You can refine results by simply telling it what to adjust, just like messaging a colleague.
Behind the scenes, the OpenClaw agent orchestrates a pipeline that sends your video's audio through a high-accuracy speech-to-text engine, aligns transcript segments to precise timestamps, applies formatting rules, and packages the output in your chosen format. The agent also handles edge cases like overlapping speech, background noise, and fast-paced dialogue to keep captions readable and synchronized.
Environment Variables
| Variable | Required | Default |
|---|
| INLINECODE5 | No | Auto-generated (100 free credits, expires in 7 days, revocable via Settings → API Tokens) |
| INLINECODE6 |
No |
https://mega-api-prod.nemovideo.ai |
|
NEMO_WEB_URL | No |
https://nemovideo.com |
|
NEMO_CLIENT_ID | No | Auto-generated UUID, persisted to
~/.config/nemovideo/client_id (UUID only, no secrets) |
|
SKILL_SOURCE | No | Auto-detected from install path, fallback
unknown |
If NEMO_TOKEN is not set, get one (requires X-Client-Id header):
# Generate or read persisted Client-Id
CLIENT_ID="${NEMO_CLIENT_ID:-$(cat ~/.config/nemovideo/client_id 2>/dev/null)}"
if [ -z "$CLIENT_ID" ]; then
CLIENT_ID=$(uuidgen 2>/dev/null || echo "client-$(date +%s)-$RANDOM")
mkdir -p ~/.config/nemovideo & echo "$CLIENT_ID" > ~/.config/nemovideo/client_id
fi
curl -s -X POST "https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token" -H "X-Client-Id: $CLIENT_ID"
# → {"code":0,"data":{"token":"nmv_usr_xxx","credits":100,...}}
Save
token as
NEMO_TOKEN,
CLIENT_ID as
NEMO_CLIENT_ID. Anonymous: 1 token per client per 7 days; token expires in 7 days and can be revoked at any time via
Settings → API Tokens on nemovideo.com. If your token expires, request a new one with the same
X-Client-Id.
Local persistence: This skill writes ~/.config/nemovideo/client_id to persist the Client-Id across sessions. This avoids generating a new ID on every request, which would hit the per-IP rate limit quickly (default 10 tokens per 7 days per IP). The file contains only a UUID — no credentials are stored locally.
2. Endpoint Directory & Routing Logic
Every caption-related request maps to a specific route — consult the table below to match each action to its correct endpoint and method.
| User says... | Action | Skip SSE |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
3. Operational Flows for Subtitle Generation
$API = ${NEMO_API_URL:-https://mega-api-prod.nemovideo.ai}, $TOKEN = ${NEMO_TOKEN}, $WEB = ${NEMO_WEB_URL:-https://nemovideo.com}.
Skill attribution variables — do NOT hardcode:
- -
$SKILL_NAME = this file's YAML frontmatter INLINECODE26 - INLINECODE27 = this file's YAML frontmatter INLINECODE28
- INLINECODE29 = detect at runtime using the first match:
1. Env
SKILL_SOURCE if set
2. If this file's path contains a known platform directory (e.g.
~/.clawhub/ →
clawhub,
~/.cursor/skills/ →
cursor)
3. Fallback: INLINECODE35
CRITICAL: ALL API requests (including render/export/upload/state/credits) MUST include these headers. Missing them will cause export to fail with 402.
CODEBLOCK2
3.0 Initializing a Caption Session
curl -s -X POST "https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE" \
-d '{"task_name":"project","language":"<lang>"}'
# → {"code":0,"data":{"task_id":"...","session_id":"..."}}
Before any transcript or timing data can be exchanged, a session must be established — this handshake creates the persistent context the backend uses to track your caption job. Without it, no downstream calls will resolve correctly.
Open in browser: After creating a session, give the user a link to view/edit the task in NemoVideo:
INLINECODE36
3.1 Streaming Messages Over SSE
curl -s -X POST "https://mega-api-prod.nemovideo.ai/run_sse" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-H "Accept: text/event-stream" -H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE" --max-time 900 \
-d '{"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}}'
All real-time caption events — including SRT chunk delivery, timing confirmations, and status updates — travel through a Server-Sent Events channel tied to the active session.
SSE Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
Typical durations: text 5-15s, video generation 100-300s, editing 10-30s.
Timeout: 10 min heartbeats-only → assume timeout. Never re-send during generation (duplicates + double-charge).
Ignore trailing "I encountered a temporary issue" if prior responses were normal.
Silent Response Fallback (CRITICAL)
Roughly 30% of subtitle edit operations come back with zero text in the response body — no caption lines, no confirmation string, nothing. This is expected behavior, not a fault. When it happens: first, do not retry the request or flag it as failed; second, call the Query State endpoint to pull the current caption timeline directly; third, surface whatever SRT data that endpoint returns as the authoritative result; fourth, continue the conversation normally using that state as ground truth.
Two-stage generation: Raw video processed through the auto-subtitle pipeline triggers a two-stage backend sequence automatically. Stage one delivers the base caption track — the raw transcript synced to the video timeline. Stage two fires without any prompt from the client: the backend appends background music and a generated title card to the captioned output. Both stages must complete before the final SRT and media package are ready for export. Poll or listen on the SSE channel until both stage signals arrive.
3.2 Pushing Media Files for Captioning
File upload: INLINECODE39
URL upload: INLINECODE40
Use me in the path; backend resolves user from token.
Supported: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
The upload endpoint accepts video and audio source files that the subtitle engine will transcribe and sync captions against.
3.3 Checking Available Caption Credits
curl -s "https://mega-api-prod.nemovideo.ai/api/credits/balance/simple" -H "Authorization: Bearer $TOKEN" \
-H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE"
# → {"code":0,"data":{"available":XXX,"frozen":XX,"total":XXX}}
Query the credits endpoint before launching any long transcription job to confirm sufficient balance exists for the operation.
3.4 Fetching Current Job & Caption State
curl -s "https://mega-api-prod.nemovideo.ai/api/state/nemo_agent/me/<sid>/latest" -H "Authorization: Bearer $TOKEN" \
-H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE"
Use
me for user in path; backend resolves from token.
Key fields:
data.state.draft,
data.state.video_infos,
data.state.canvas_config,
data.state.generated_media.
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
Draft ready for export when draft.t exists with at least one track with non-empty sg.
Track summary format:
CODEBLOCK7
3.5 Exporting & Delivering the Final Caption Package
Export does NOT cost credits. Only generation/editing consumes credits.
Exporting a finished subtitle file — whether SRT, VTT, or burned-in video — consumes zero credits. The sequence runs as follows: (a) confirm both pipeline stages have completed via state query; (b) call the export endpoint with the desired caption format specified in the payload; (c) receive the signed delivery URL in the response; (d) stream or download the captioned asset from that URL; (e) acknowledge delivery back to the session so the backend can mark the job closed.
b) Submit: INLINECODE52
Note: sessionId is camelCase (exception). On failure → new id, retry once.
c) Poll (every 30s, max 10 polls): INLINECODE55
Status at top-level status: pending → processing → completed / failed. Download URL at output.url.
d) Download from output.url → send to user. Fallback: https://mega-api-prod.nemovideo.ai/api/render/proxy/<id>/download.
e) When delivering the video, always also give the task detail link: INLINECODE60
Progress messages: start "⏳ Rendering ~30s" → "⏳ 50%" → "✅ Video ready!" + file + task detail link.
3.6 Reconnecting After an SSE Drop
SSE connections drop — plan for it. Recovery follows five steps: 1) Detect the disconnect event on the client side immediately rather than waiting for a timeout. 2) Do not create a new session; the existing session ID remains valid. 3) Re-open the SSE channel using the same session ID and the Last-Event-ID header set to the most recently received event identifier. 4) Query the current caption state to backfill any SRT chunks or timing events that arrived during the gap. 5) Resume normal streaming from the last confirmed caption timestamp, discarding nothing from the pre-disconnect transcript.
4. GUI Layer Translation Rules
The backend operates under the assumption that a graphical interface sits between it and the end user — raw GUI instructions, button labels, or screen-level directions must never be forwarded through the API.
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Show state via §3.4 |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute §3.5 |
| "check account/billing" | Check §3.3 |
Keep content descriptions. Strip GUI actions.
5. Conversation & Interaction Patterns
- - Lead with intent, not mechanics: When a user asks to caption a video, confirm the source file and target language before touching any endpoint — rushing to upload without context produces misaligned SRT output.
- Translate edits into caption operations: Phrases like 'fix that line' or 'move the subtitle earlier' map to specific timing-adjustment calls; identify the correct operation rather than asking the user to specify an endpoint.
- Hold state across turns: The active session and the last-known caption timeline should persist through the entire conversation — never lose track of which subtitle block is currently in scope.
- Surface SRT data meaningfully: When returning caption results, present timing and text in a readable format rather than dumping raw file content; summarize changes made to the transcript when relevant.
- Anticipate the two-stage lag: After raw video is submitted, set user expectations that a second processing stage will follow automatically — do not treat the first-stage caption delivery as the final result.
6. Known Constraints & Limitations
- - File size and duration ceilings are enforced at the upload stage — submissions exceeding the defined thresholds will be rejected before transcription begins.
- Only the supported caption formats (SRT, VTT, and burned-in render) are available for export; custom or proprietary subtitle formats are outside scope.
- Simultaneous sessions per credential are capped; attempting to open additional sessions beyond the limit will return an authorization error rather than queuing.
- The auto-subtitle engine processes one language track per job — multi-language caption generation requires separate sequential submissions.
- BGM and title-card injection in stage two cannot be disabled or skipped; they are applied to every raw video job without exception.
7. Error Codes & Recovery Guidance
When a caption request goes wrong, the response code and message together tell you exactly where in the subtitle pipeline the failure occurred — map them using the table below.
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Common: no video → generate first; render fail → retry new id; SSE timeout → §3.6; silent edit → §3.1 fallback.
8. API Version & Permission Scopes
Always verify the active API version before building a caption request — version mismatches silently alter field availability and can corrupt SRT timing data in the response. Token scopes govern which subtitle operations are permitted: a read-only scope covers state queries and credit checks, while a write scope is required for uploads, session creation, and export triggers. Confirm both the version header value and the granted scopes during initial integration; do not assume defaults carry the permissions needed for full caption pipeline access.
9. Integration Guide
Integrating the auto-subtitle-generator-online skill into your existing workflow is straightforward and requires no dedicated infrastructure on your end. Start by connecting your ClawHub account to your preferred storage layer — whether that's a cloud bucket, a direct file upload, or a public video URL. Once connected, the OpenClaw agent can accept video files in mp4, mov, avi, webm, or mkv formats up to the plan's size limit and begin transcription immediately.
For teams using content management systems or video hosting platforms, the skill supports webhook callbacks that notify your system when a subtitle file is ready for download or when a captioned video has been processed. This makes it easy to slot auto-subtitle-generator-online into automated publishing pipelines without manual intervention at the caption stage.
Developers building custom applications can invoke the skill programmatically through the ClawHub agent API, passing parameters such as target language, caption format (SRT, VTT, or burned-in), font preferences, and output destination. The skill returns structured metadata alongside the subtitle file, including word-level confidence scores and flagged segments where audio quality may have affected accuracy — giving your team clear signals on where a human review pass might add value.
0. 首次接触
当用户打开此技能或发送第一条消息时,立即问候他们:
欢迎!我在这里使用在线自动字幕生成器为您的视频自动生成准确的字幕——只需分享您的视频文件或链接,告诉我您偏好的语言和字幕样式,我就能在几分钟内为您准备好带字幕的视频或SRT文件。准备好为您的第一个视频添加字幕了吗?
试试说:
- - 为这个MP4采访生成英文字幕,并给我一个可以上传到YouTube的SRT文件。
- 为我的MOV教程视频创建内嵌字幕,使用白色粗体文字和半透明黑色背景。
- 转录这个WebM格式的西班牙语网络研讨会,并生成西班牙语和英语的双语字幕。
重要提示:首次接触时务必主动问候用户。让他们知道您正在设置连接。首次接触时务必主动问候用户。
自动设置
当用户首次交互时,建立连接:
- 1. 检查令牌:如果设置了NEMO_TOKEN环境变量,则使用它。否则:
- 读取或生成客户端ID:
- 如果存在,读取~/.config/nemovideo/client_id
- 否则生成一个UUID,保存到~/.config/nemovideo/client_id
- 3. 获取匿名令牌:
bash
curl -s -X POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token -H X-Client-Id: $CLIENT_ID
将返回的token存储为本会话的NEMO_TOKEN。您将获得100个免费积分。
- 4. 创建会话(§3.0),以便立即开始工作。
简要告知用户:正在设置……准备就绪!然后继续处理他们的请求。
将任何语音即时转化为屏幕文字
在线自动字幕生成器技能旨在解决视频制作中最耗时的环节之一:无需在字幕编辑器中花费数小时,即可为您的素材添加准确、时机恰当的字幕。无论您是在发布YouTube教程、企业培训模块还是社交媒体短片,清晰易读的字幕都能显著提升观看时长和可访问性合规性。
此技能通过对话式编辑模型运作,意味着您无需填写表单或与时间线搏斗。相反,您只需描述您的需求——语言、字幕位置、样式偏好,或者您想要内嵌字幕轨道还是单独的SRT文件——技能将处理其余工作。您只需告诉它需要调整什么,就像给同事发消息一样,即可优化结果。
在幕后,OpenClaw代理编排了一个流程,将您视频的音频发送到高精度语音转文本引擎,将转录片段对齐到精确的时间戳,应用格式规则,并以您选择的格式打包输出。该代理还处理重叠语音、背景噪音和快节奏对话等边缘情况,以保持字幕的可读性和同步性。
环境变量
| 变量 | 必需 | 默认值 |
|---|
| NEMOTOKEN | 否 | 自动生成(100个免费积分,7天后过期,可通过设置→API令牌撤销) |
| NEMOAPI_URL |
否 | https://mega-api-prod.nemovideo.ai |
| NEMO
WEBURL | 否 | https://nemovideo.com |
| NEMO
CLIENTID | 否 | 自动生成的UUID,持久化到~/.config/nemovideo/client_id(仅UUID,无密钥) |
| SKILL_SOURCE | 否 | 从安装路径自动检测,回退为unknown |
如果未设置NEMO_TOKEN,获取一个(需要X-Client-Id头):
bash
生成或读取持久化的客户端ID
CLIENT
ID=${NEMOCLIENT
ID:-$(cat ~/.config/nemovideo/clientid 2>/dev/null)}
if [ -z $CLIENT_ID ]; then
CLIENT_ID=$(uuidgen 2>/dev/null || echo client-$(date +%s)-$RANDOM)
mkdir -p ~/.config/nemovideo & echo $CLIENT
ID > ~/.config/nemovideo/clientid
fi
curl -s -X POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token -H X-Client-Id: $CLIENT_ID
→ {code:0,data:{token:nmvusrxxx,credits:100,...}}
将token保存为NEMOTOKEN,CLIENTID保存为NEMOCLIENTID。匿名:每个客户端每7天1个令牌;令牌7天后过期,可随时通过nemovideo.com上的设置→API令牌撤销。如果您的令牌过期,使用相同的X-Client-Id请求一个新令牌。
本地持久化: 此技能写入~/.config/nemovideo/client_id以在会话间持久化客户端ID。这避免了每次请求都生成新ID,否则会很快达到每IP速率限制(默认每IP每7天10个令牌)。该文件仅包含一个UUID——本地不存储任何凭据。
2. 端点目录与路由逻辑
每个与字幕相关的请求都映射到特定的路由——请参考下表将每个操作匹配到其正确的端点和方法。
| 用户说... | 操作 | 跳过SSE |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
3. 字幕生成操作流程
$API = ${NEMOAPIURL:-https://mega-api-prod.nemovideo.ai}, $TOKEN = ${NEMOTOKEN}, $WEB = ${NEMOWEB_URL:-https://nemovideo.com}.
技能归属变量——请勿硬编码:
- - $SKILLNAME = 此文件的YAML前置元数据name
- $SKILLVERSION = 此文件的YAML前置元数据version
- $SKILL_SOURCE = 运行时检测,使用第一个匹配项:
1. 如果设置了环境变量SKILL_SOURCE
2. 如果此文件的路径包含已知平台目录(例如~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor)
3. 回退:unknown
关键:所有API请求(包括渲染/导出/上传/状态/积分)必须包含这些头。缺少它们将导致导出失败并返回402。
X-Skill-Source: $SKILL_NAME
X-Skill-Version: $SKILL_VERSION
X-Skill-Platform: $SKILL_SOURCE
3.0 初始化字幕会话
bash
curl -s -X POST https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent \
-H Authorization: Bearer $TOKEN -H Content-Type: application/json \
-H X-Skill-Source: $SKILL
NAME -H X-Skill-Version: $SKILLVERSION -H X-Skill-Platform: $SKILL_SOURCE \
-d {task_name:project,language:
}
→ {code:0,data:{taskid:...,sessionid:...}}
在任何转录或时间数据交换之前,必须建立会话——此握手创建了后端用于跟踪您的字幕作业的持久上下文。没有它,后续调用将无法正确解析。
在浏览器中打开:创建会话后,给用户一个链接,用于在NemoVideo中查看/编辑任务:
$WEB/workspace/claim?task={taskid}&session={sessionid}&skillname=$SKILLNAME&skillversion=$SKILLVERSION&skillsource=$SKILLSOURCE
3.1 通过SSE流式传输消息
bash
curl -s -X POST https://mega-api-prod.nemovideo.ai/run_sse \
-H Authorization: Bearer $TOKEN -H Content-Type: application/json \
-H Accept: text/event-stream -H X-Skill-Source: $SKILLNAME -H X-Skill-Version: $SKILLVERSION -H X-Skill-Platform: $SKILL_SOURCE --max-time 900 \
-d {appname:nemoagent,userid:me,sessionid:,