Getting Started
Welcome! I'm here to help you generate accurate, time-synced subtitles from your video's audio track. Upload your video file and tell me your preferred subtitle format or any specific requirements — let's get your captions ready to go!
Try saying:
- - "Generate subtitles for this mp4 interview video and export them as an SRT file"
- "Create captions for my webinar recording — the speaker has a slight accent so please be extra careful with accuracy"
- "I have a 45-minute mkv documentary — can you produce a VTT subtitle file with line breaks kept under 42 characters?"
Quick Start Setup
This skill connects to NemoVideo's cloud backend. On first use, set up the connection automatically and let the user know ("Connecting to NemoVideo...").
Token check: Look for NEMO_TOKEN in the environment. If found, skip to session creation. Otherwise:
- - Read
~/.config/nemovideo/client_id or generate a UUID and save it there - POST
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with X-Client-Id header - Extract
data.token from the response — this is your NEMO_TOKEN (100 free credits, 7-day expiry)
Session: POST /api/tasks/me/with-session/nemo_agent at the same host with Bearer auth and body {"task_name":"project"}. Keep the returned session_id for all operations.
Let the user know with a brief "Ready!" when setup is complete. Don't expose tokens or raw API output.
Turn Every Word Spoken Into Readable, Synced Subtitles
Whether you're publishing a YouTube tutorial, captioning a corporate training video, or making a documentary accessible to deaf and hard-of-hearing audiences, getting subtitles right matters. This skill listens to the audio in your video file and converts every spoken word into a properly timed subtitle file — no manual typing, no tedious timestamp adjustments, and no expensive transcription services required.
The audio-to-subtitle-generator works by analyzing the speech track in your uploaded video, segmenting it into readable lines, and attaching precise start and end timestamps to each segment. The result is a subtitle file you can drop directly into your video editor, upload to YouTube or Vimeo, or embed into your website player.
This is especially valuable for multilingual teams, solo creators working at scale, or anyone who needs to repurpose recorded content across multiple formats. Instead of spending hours scrubbing through a timeline, you get a complete subtitle draft in a fraction of the time — ready to review, edit if needed, and publish with confidence.
Routing Your Transcription Requests
Each subtitle generation request is parsed for audio source, language preference, and caption format, then routed to the appropriate transcription pipeline automatically.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
NemoVideo API Reference
The NemoVideo backend handles speech-to-text processing by analyzing audio waveforms, detecting speaker segments, and outputting time-coded subtitle tracks in SRT, VTT, or plain text formats. Requests are authenticated via bearer token and processed asynchronously, with subtitle files returned once the transcription job completes.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE9 - INLINECODE10 : from frontmatter INLINECODE11
- INLINECODE12 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE22
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: INLINECODE27
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE33
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE37
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE41
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Best Practices
For the most accurate subtitle output, start with the cleanest audio possible. Videos with minimal background noise, consistent microphone placement, and clear speech will produce subtitles that need little to no manual correction after generation.
If your video features technical jargon, brand names, or industry-specific terminology, mention key terms upfront so they can be handled with greater care during transcription. This is particularly useful for medical, legal, or technology-focused content where a misheard word can change meaning significantly.
Keep subtitle line lengths readable — aim for no more than two lines on screen at a time and avoid breaking sentences mid-thought when possible. When reviewing your generated subtitles, pay special attention to speaker transitions and moments with overlapping dialogue, as these are the most common areas where timing may need a small manual nudge before publishing.
Quick Start Guide
Getting started with the audio-to-subtitle-generator is straightforward. Begin by uploading your video file in one of the supported formats: mp4, mov, avi, webm, or mkv. Once uploaded, specify your preferred output format — SRT is the most universally compatible, while VTT works best for web-based players and HTML5 video.
If your video contains multiple speakers, mention that upfront so subtitles can be segmented clearly between voices. You can also specify a maximum characters-per-line limit if your platform has display constraints — 42 characters per line is a common broadcast standard.
Once processing is complete, you'll receive your subtitle file ready for download. You can import it directly into Adobe Premiere Pro, DaVinci Resolve, Final Cut Pro, or upload it alongside your video on YouTube, Vimeo, or any streaming platform that accepts external caption files.
Use Cases
The audio-to-subtitle-generator serves a wide range of real-world workflows. Content creators on YouTube and TikTok use it to add captions that boost watch time and reach viewers who watch without sound — a habit that now represents over 85% of mobile video consumption.
Educators and e-learning developers rely on it to make course videos ADA and WCAG compliant, ensuring students with hearing impairments have full access to lecture content. Legal and medical professionals use it to transcribe recorded depositions, patient consultations, or training sessions where accuracy and timestamping are critical for documentation.
Journalists and podcast producers convert recorded interviews into subtitle files that double as searchable transcripts. Corporate communications teams use it to caption internal town halls, product demos, and onboarding videos — making content reusable across global teams regardless of language or hearing ability.
快速开始
欢迎!我来帮你从视频音轨中生成准确、时间同步的字幕。上传你的视频文件,告诉我你偏好的字幕格式或任何特殊要求——让我们准备好你的字幕!
试试这样说:
- - 为这个MP4采访视频生成字幕,并导出为SRT文件
- 为我的网络研讨会录制内容创建字幕——演讲者带有轻微口音,请特别注意准确性
- 我有一部45分钟的MKV纪录片——你能生成一个每行不超过42个字符的VTT字幕文件吗?
快速启动设置
此技能连接到NemoVideo的云端后端。首次使用时,自动建立连接并通知用户(正在连接NemoVideo...)。
令牌检查:在环境中查找NEMO_TOKEN。如果找到,跳转到会话创建。否则:
- - 读取~/.config/nemovideo/clientid或生成一个UUID并保存到该位置
- 使用X-Client-Id头信息POST请求https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token
- 从响应中提取data.token——这就是你的NEMOTOKEN(100个免费积分,7天有效期)
会话:在同一主机上使用Bearer认证和主体{taskname:project}POST请求/api/tasks/me/with-session/nemoagent。保留返回的session_id用于所有操作。
设置完成后,用简短的准备就绪!通知用户。不要暴露令牌或原始API输出。
将每句说出的话转化为可读、同步的字幕
无论你是发布YouTube教程、为企业培训视频添加字幕,还是让纪录片对听障人士可访问,正确的字幕至关重要。此技能会听取视频文件中的音频,并将每句说出的话转换为时间精准的字幕文件——无需手动输入、无需繁琐的时间戳调整、无需昂贵的转录服务。
音频转字幕生成器通过分析上传视频中的语音轨道,将其分割为可读的行,并为每个片段附加精确的开始和结束时间戳。最终生成的字幕文件可以直接导入视频编辑器、上传到YouTube或Vimeo,或嵌入到网站播放器中。
这对多语言团队、大规模独立创作者,或任何需要跨多种格式复用录制内容的人来说尤其有价值。无需花费数小时在时间线上反复拖动,你就能在极短时间内获得完整的字幕草稿——随时可以审阅、必要时编辑,并自信地发布。
路由你的转录请求
每个字幕生成请求都会被解析出音频来源、语言偏好和字幕格式,然后自动路由到相应的转录管道。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
NemoVideo API参考
NemoVideo后端通过分析音频波形、检测说话者片段,并输出SRT、VTT或纯文本格式的时间编码字幕轨道来处理语音转文本。请求通过Bearer令牌进行身份验证并异步处理,转录任务完成后返回字幕文件。
技能归属——运行时从此文件的YAML前置元数据中读取:
- - X-Skill-Source:audio-to-subtitle-generator
- X-Skill-Version:来自前置元数据version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为unknown)
所有请求必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属头信息将导致导出失败,返回402错误。
API基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST请求/api/tasks/me/with-session/nemoagent——主体{taskname:project,language:}——返回taskid、sessionid。创建会话后,给用户一个链接:https://nemovideo.com/workspace/claim?token=&task=id>&session=id>&skillname=audio-to-subtitle-generator&skillversion=1.0.0&skill_source=
发送消息(SSE):POST请求/runsse——主体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}},附带Accept: text/event-stream。最大超时时间:15分钟。
上传:POST请求/api/upload-video/nemoagent/me/——文件:multipart格式-F files=@/path,或URL:{urls:[],sourcetype:url}
积分:GET请求/api/credits/balance/simple——返回available、frozen、total
会话状态:GET请求/api/state/nemoagent/me//latest——关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST请求/api/render/proxy/lambda——主体{id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET请求/api/render/proxy/lambda/,直到status = completed。下载URL位于output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空data: | 继续等待。每2分钟:⏳ 仍在处理中... |
| 流关闭 | 处理最终响应 |
约30%的编辑操作在SSE流中不返回文本。发生这种情况时:轮询会话状态以验证编辑已应用,然后向用户总结更改。
后端响应翻译
后端假定存在GUI。将这些翻译为API操作:
| 后端说 | 你执行 |
|---|
| click [button] / 点击 | 通过API执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过SSE发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=持续时间(毫秒),m=元数据。
时间线(3条轨道):1. 视频:城市延时摄影(0-10秒)2. 背景音乐:Lo-fi(0-10秒,35%)3. 标题:都市梦想(0-3秒)
错误处理
令牌错误/过期 | 通过anonymous-token重新认证(令牌7天后过期) |
| 1002 | 会话未找到 | 新建会话§3.0 |
| 2001 | 积分不足 | 匿名用户:显示注册URL,附带?bind=
(需要时从create-session或state响应获取)。已注册用户:请到nemovideo.ai充值 |
| 4001 | 不支持的文件 | 显示支持的格式 |
| 4002 | 文件过大 | 建议压缩/裁剪 |
| 400 | 缺少X-Client-Id | 生成Client-Id并重试(参见§1) |
| 402 | 免费计划导出被阻止 | 订阅层级问题,非积分问题。请到nemovideo.ai注册以解锁导出功能。 |
| 429 | 速率限制(1个令牌/客户端/7天) | 30秒后重试一次 |
最佳实践
为了获得最准确的字幕输出,