Getting Started

Welcome! I'm here to help you generate accurate, time-synced subtitles from your video's audio track. Upload your video file and tell me your preferred subtitle format or any specific requirements — let's get your captions ready to go!

Try saying:

- "Generate subtitles for this mp4 interview video and export them as an SRT file"
"Create captions for my webinar recording — the speaker has a slight accent so please be extra careful with accuracy"
"I have a 45-minute mkv documentary — can you produce a VTT subtitle file with line breaks kept under 42 characters?"

Quick Start Setup

This skill connects to NemoVideo's cloud backend. On first use, set up the connection automatically and let the user know ("Connecting to NemoVideo...").

Token check: Look for NEMO_TOKEN in the environment. If found, skip to session creation. Otherwise:

- Read ~/.config/nemovideo/client_id or generate a UUID and save it there
POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with X-Client-Id header
Extract data.token from the response — this is your NEMO_TOKEN (100 free credits, 7-day expiry)

Session: POST /api/tasks/me/with-session/nemo_agent at the same host with Bearer auth and body {"task_name":"project"}. Keep the returned session_id for all operations.

Let the user know with a brief "Ready!" when setup is complete. Don't expose tokens or raw API output.

Turn Every Word Spoken Into Readable, Synced Subtitles

Whether you're publishing a YouTube tutorial, captioning a corporate training video, or making a documentary accessible to deaf and hard-of-hearing audiences, getting subtitles right matters. This skill listens to the audio in your video file and converts every spoken word into a properly timed subtitle file — no manual typing, no tedious timestamp adjustments, and no expensive transcription services required.

The audio-to-subtitle-generator works by analyzing the speech track in your uploaded video, segmenting it into readable lines, and attaching precise start and end timestamps to each segment. The result is a subtitle file you can drop directly into your video editor, upload to YouTube or Vimeo, or embed into your website player.

This is especially valuable for multilingual teams, solo creators working at scale, or anyone who needs to repurpose recorded content across multiple formats. Instead of spending hours scrubbing through a timeline, you get a complete subtitle draft in a fraction of the time — ready to review, edit if needed, and publish with confidence.

Routing Your Transcription Requests

Each subtitle generation request is parsed for audio source, language preference, and caption format, then routed to the appropriate transcription pipeline automatically.

User says...	Action	Skip SSE?
"export" / "导出" / "download" / "send me the video"	→ §3.5 Export	✅
"credits" / "积分" / "balance" / "余额"

→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |

NemoVideo API Reference

The NemoVideo backend handles speech-to-text processing by analyzing audio waveforms, detecting speaker segments, and outputting time-coded subtitle tracks in SRT, VTT, or plain text formats. Requests are authenticated via bearer token and processed asynchronously, with subtitle files returned once the transcription job completes.

Skill attribution — read from this file's YAML frontmatter at runtime:

- X-Skill-Source: INLINECODE9
INLINECODE10: from frontmatter INLINECODE11
INLINECODE12: detect from install path (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)

All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.

API base: INLINECODE22

Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: INLINECODE27

Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.

Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE33

Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE37

Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE41

Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.

Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

SSE Event Handling

Event	Action
Text response	Apply GUI translation (§4), present to user
Tool call/result

~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.

Backend Response Translation

The backend assumes a GUI exists. Translate these into API actions:

Backend says	You do
"click [button]" / "点击"	Execute via API
"open [panel]" / "打开"

Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.

CODEBLOCK0

Error Handling

Code	Meaning	Action
0	Success	Continue
1001

Best Practices

For the most accurate subtitle output, start with the cleanest audio possible. Videos with minimal background noise, consistent microphone placement, and clear speech will produce subtitles that need little to no manual correction after generation.

If your video features technical jargon, brand names, or industry-specific terminology, mention key terms upfront so they can be handled with greater care during transcription. This is particularly useful for medical, legal, or technology-focused content where a misheard word can change meaning significantly.

Keep subtitle line lengths readable — aim for no more than two lines on screen at a time and avoid breaking sentences mid-thought when possible. When reviewing your generated subtitles, pay special attention to speaker transitions and moments with overlapping dialogue, as these are the most common areas where timing may need a small manual nudge before publishing.

Quick Start Guide

Getting started with the audio-to-subtitle-generator is straightforward. Begin by uploading your video file in one of the supported formats: mp4, mov, avi, webm, or mkv. Once uploaded, specify your preferred output format — SRT is the most universally compatible, while VTT works best for web-based players and HTML5 video.

If your video contains multiple speakers, mention that upfront so subtitles can be segmented clearly between voices. You can also specify a maximum characters-per-line limit if your platform has display constraints — 42 characters per line is a common broadcast standard.

Once processing is complete, you'll receive your subtitle file ready for download. You can import it directly into Adobe Premiere Pro, DaVinci Resolve, Final Cut Pro, or upload it alongside your video on YouTube, Vimeo, or any streaming platform that accepts external caption files.

Use Cases

The audio-to-subtitle-generator serves a wide range of real-world workflows. Content creators on YouTube and TikTok use it to add captions that boost watch time and reach viewers who watch without sound — a habit that now represents over 85% of mobile video consumption.

Educators and e-learning developers rely on it to make course videos ADA and WCAG compliant, ensuring students with hearing impairments have full access to lecture content. Legal and medical professionals use it to transcribe recorded depositions, patient consultations, or training sessions where accuracy and timestamping are critical for documentation.

Journalists and podcast producers convert recorded interviews into subtitle files that double as searchable transcripts. Corporate communications teams use it to caption internal town halls, product demos, and onboarding videos — making content reusable across global teams regardless of language or hearing ability.

快速开始

欢迎！我来帮你从视频音轨中生成准确、时间同步的字幕。上传你的视频文件，告诉我你偏好的字幕格式或任何特殊要求——让我们准备好你的字幕！

试试这样说：

- 为这个MP4采访视频生成字幕，并导出为SRT文件
为我的网络研讨会录制内容创建字幕——演讲者带有轻微口音，请特别注意准确性
我有一部45分钟的MKV纪录片——你能生成一个每行不超过42个字符的VTT字幕文件吗？

快速启动设置

此技能连接到NemoVideo的云端后端。首次使用时，自动建立连接并通知用户（正在连接NemoVideo...）。

令牌检查：在环境中查找NEMO_TOKEN。如果找到，跳转到会话创建。否则：

- 读取~/.config/nemovideo/clientid或生成一个UUID并保存到该位置
使用X-Client-Id头信息POST请求https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token
从响应中提取data.token——这就是你的NEMOTOKEN（100个免费积分，7天有效期）

会话：在同一主机上使用Bearer认证和主体{taskname:project}POST请求/api/tasks/me/with-session/nemoagent。保留返回的session_id用于所有操作。

设置完成后，用简短的准备就绪！通知用户。不要暴露令牌或原始API输出。

将每句说出的话转化为可读、同步的字幕

无论你是发布YouTube教程、为企业培训视频添加字幕，还是让纪录片对听障人士可访问，正确的字幕至关重要。此技能会听取视频文件中的音频，并将每句说出的话转换为时间精准的字幕文件——无需手动输入、无需繁琐的时间戳调整、无需昂贵的转录服务。

音频转字幕生成器通过分析上传视频中的语音轨道，将其分割为可读的行，并为每个片段附加精确的开始和结束时间戳。最终生成的字幕文件可以直接导入视频编辑器、上传到YouTube或Vimeo，或嵌入到网站播放器中。

这对多语言团队、大规模独立创作者，或任何需要跨多种格式复用录制内容的人来说尤其有价值。无需花费数小时在时间线上反复拖动，你就能在极短时间内获得完整的字幕草稿——随时可以审阅、必要时编辑，并自信地发布。

路由你的转录请求

每个字幕生成请求都会被解析出音频来源、语言偏好和字幕格式，然后自动路由到相应的转录管道。

用户说...	操作	跳过SSE？
export / 导出 / download / send me the video	→ §3.5 导出	✅
credits / 积分 / balance / 余额

→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容（生成、编辑、添加背景音乐等） | → §3.1 SSE | ❌ |

NemoVideo API参考

NemoVideo后端通过分析音频波形、检测说话者片段，并输出SRT、VTT或纯文本格式的时间编码字幕轨道来处理语音转文本。请求通过Bearer令牌进行身份验证并异步处理，转录任务完成后返回字幕文件。

技能归属——运行时从此文件的YAML前置元数据中读取：

- X-Skill-Source：audio-to-subtitle-generator
X-Skill-Version：来自前置元数据version
X-Skill-Platform：从安装路径检测（~/.clawhub/ → clawhub，~/.cursor/skills/ → cursor，否则为unknown）

所有请求必须包含：Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属头信息将导致导出失败，返回402错误。

API基础地址：https://mega-api-prod.nemovideo.ai

创建会话：POST请求/api/tasks/me/with-session/nemoagent——主体{taskname:project,language:}——返回taskid、sessionid。创建会话后，给用户一个链接：https://nemovideo.com/workspace/claim?token=&task=id>&session=id>&skillname=audio-to-subtitle-generator&skillversion=1.0.0&skill_source=

发送消息（SSE）：POST请求/runsse——主体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}}，附带Accept: text/event-stream。最大超时时间：15分钟。

上传：POST请求/api/upload-video/nemoagent/me/——文件：multipart格式-F files=@/path，或URL：{urls:[],sourcetype:url}

积分：GET请求/api/credits/balance/simple——返回available、frozen、total

会话状态：GET请求/api/state/nemoagent/me//latest——关键字段：data.state.draft、data.state.videoinfos、data.state.generated_media

导出（免费，不消耗积分）：POST请求/api/render/proxy/lambda——主体{id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET请求/api/render/proxy/lambda/，直到status = completed。下载URL位于output.url。

支持的格式：mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。

SSE事件处理

事件	操作
文本响应	应用GUI翻译（§4），呈现给用户
工具调用/结果

约30%的编辑操作在SSE流中不返回文本。发生这种情况时：轮询会话状态以验证编辑已应用，然后向用户总结更改。

后端响应翻译

后端假定存在GUI。将这些翻译为API操作：

后端说	你执行
click [button] / 点击	通过API执行
open [panel] / 打开

草稿字段映射：t=轨道，tt=轨道类型（0=视频，1=音频，7=文本），sg=片段，d=持续时间（毫秒），m=元数据。

时间线（3条轨道）：1. 视频：城市延时摄影（0-10秒）2. 背景音乐：Lo-fi（0-10秒，35%）3. 标题：都市梦想（0-3秒）

错误处理

代码	含义	操作
0	成功	继续
1001

最佳实践

为了获得最准确的字幕输出，

audio-to-subtitle-generator音频转字幕

audio-to-subtitle-generator

Getting Started

Quick Start Setup

Turn Every Word Spoken Into Readable, Synced Subtitles

Routing Your Transcription Requests

NemoVideo API Reference

SSE Event Handling

Backend Response Translation

Error Handling

Best Practices

Quick Start Guide

Use Cases

快速开始

快速启动设置

将每句说出的话转化为可读、同步的字幕

路由你的转录请求

NemoVideo API参考

SSE事件处理

后端响应翻译

错误处理

最佳实践

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

audio-to-subtitle-generator音频转字幕

audio-to-subtitle-generator

Getting Started

Quick Start Setup

Turn Every Word Spoken Into Readable, Synced Subtitles

Routing Your Transcription Requests

NemoVideo API Reference

SSE Event Handling

Backend Response Translation

Error Handling

Best Practices

Quick Start Guide

Use Cases

快速开始

快速启动设置

将每句说出的话转化为可读、同步的字幕

路由你的转录请求

NemoVideo API参考

SSE事件处理

后端响应翻译

错误处理

最佳实践

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement