Getting Started

Welcome to the Video Caption Generator AI (A/B Legacy) — paste a video link or upload your file and I'll generate accurate, synced captions ready to export. Drop your footage here to get started.

Try saying:

- "Generate captions for this 8-minute tutorial video and export them as an SRT file with accurate timestamps"
"My interview footage has two speakers with some background noise — can you transcribe and caption it with speaker labels?"
"Create captions for this Instagram Reel and format them as short 2-3 word bursts for on-screen display"

First-Time Connection

When a user first opens this skill, connect to the processing backend automatically. Briefly let them know (e.g. "Setting up...").

Authentication: Check if NEMO_TOKEN is set in the environment. If it is, skip to step 2.

1. Obtain a free token: Generate a random UUID as client identifier. POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with header X-Client-Id set to that UUID. The response data.token is your NEMO_TOKEN — 100 free credits, valid 7 days.
Create a session: POST to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Authorization: Bearer <token>, Content-Type: application/json, and body {"task_name":"project","language":"<detected>"}. Store the returned session_id for all subsequent requests.

Keep setup communication brief. Don't display raw API responses or token values to the user.

Captions That Actually Sync With Your Audio

Getting captions right is harder than it looks. Automated tools often mis-time lines, split sentences at awkward breaks, or completely fumble proper nouns and technical terms. The video-caption-generator-ai-ab-old skill was built specifically to address those gaps, using a model variant that was fine-tuned through A/B testing rounds to improve sync accuracy and readability across a wide range of video types.

Whether you're working with a talking-head interview, a screen recording with voiceover, or a fast-paced social video with background music, this skill processes the audio layer and produces captions that align tightly with spoken words. It handles filler words gracefully, doesn't hallucinate lines, and respects natural pauses when deciding where to break caption blocks.

The output is practical and portable — you get caption text with timestamps that can be dropped into editing timelines, uploaded as subtitle files, or reformatted for accessibility compliance. No post-processing gymnastics required. If you've ever spent an hour fixing auto-generated captions by hand, this is the workflow you've been looking for.

Caption Request Routing Logic

Every caption request you submit gets parsed by the A/B Legacy dispatcher, which evaluates your footage metadata and queues it to the appropriate transcription pipeline based on language model version and frame rate compatibility.

User says...	Action	Skip SSE?
"export" / "导出" / "download" / "send me the video"	→ §3.5 Export	✅
"credits" / "积分" / "balance" / "余额"

→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |

Legacy API Backend Reference

The V2 legacy cloud processor handles all caption generation jobs through a distributed speech-to-text layer that timestamps each caption block at the frame level, syncing dialogue boundaries to your original timecode. Batch caption exports and SRT/VTT outputs are rendered server-side before being pushed back to your session workspace.

Skill attribution — read from this file's YAML frontmatter at runtime:

- X-Skill-Source: INLINECODE10
INLINECODE11: from frontmatter INLINECODE12
INLINECODE13: detect from install path (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)

All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.

API base: INLINECODE23

Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id.

Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.

Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE33

Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE37

Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE41

Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.

Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

SSE Event Handling

Event	Action
Text response	Apply GUI translation (§4), present to user
Tool call/result

~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.

Backend Response Translation

The backend assumes a GUI exists. Translate these into API actions:

Backend says	You do
"click [button]" / "点击"	Execute via API
"open [panel]" / "打开"

Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.

CODEBLOCK0

Error Handling

Code	Meaning	Action
0	Success	Continue
1001

Tips and Tricks

For best results with video-caption-generator-ai-ab-old, upload audio that's as clean as possible — even a basic noise reduction pass before submitting will noticeably improve transcription accuracy on clips recorded in echoey rooms or outdoors.

If your video contains technical jargon, product names, or uncommon proper nouns, include a short glossary or word list in your prompt. The skill will prioritize those spellings over its default phonetic guesses, which cuts down on correction time significantly.

When working with multi-speaker content, specify whether you want speaker labels in the output. The A/B legacy model handles overlapping dialogue better when it knows to look for turn-taking patterns, so flagging this upfront produces cleaner results than asking for labels after the fact.

For social-format captions — short punchy lines meant to appear as on-screen text — ask for a maximum of 4-5 words per caption block. The skill will re-segment the transcript to fit that constraint rather than just splitting the standard output.

Performance Notes

The video-caption-generator-ai-ab-old variant performs strongest on English-language content with a single primary speaker, where timestamp sync accuracy typically lands within 200–400ms of the actual spoken word. For multilingual or code-switched audio, expect slightly longer processing time as the model identifies language boundaries before transcribing each segment.

Very long videos — anything over 45 minutes — may benefit from being split into chapters or segments before submission. The skill can process longer files, but chunking the input often produces more consistent caption quality across the full runtime rather than degradation toward the end of a long audio track.

Background music at moderate volume is handled well, but heavy bass-heavy soundtracks or audio where music volume matches speech volume will reduce accuracy. If your footage has this issue, a quick vocal isolation step beforehand is worth the extra minute. Caption export formats supported include SRT, VTT, and plain text with inline timestamps.

开始使用

欢迎使用视频字幕生成器AI（A/B旧版）——粘贴视频链接或上传文件，我将生成准确且同步的字幕，随时可供导出。将您的素材拖放到此处即可开始。

试试这样说：

- 为这个8分钟的教程视频生成字幕，并导出为带有精确时间戳的SRT文件
我的采访视频有两位说话者，还有一些背景噪音——你能转录并添加带说话者标签的字幕吗？
为这个Instagram Reel创建字幕，并格式化为2-3个单词的短句用于屏幕显示

首次连接

当用户首次打开此技能时，自动连接到处理后端。简要告知用户（例如正在设置...）。

身份验证：检查环境中是否设置了NEMO_TOKEN。如果已设置，跳至步骤2。

1. 获取免费令牌：生成一个随机UUID作为客户端标识符。向https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token发送POST请求，请求头X-Client-Id设置为该UUID。响应中的data.token即为您的NEMOTOKEN——100个免费积分，有效期7天。
创建会话：向https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemoagent发送POST请求，请求头包含Authorization: Bearer 和Content-Type: application/json，请求体为{taskname:project,language:<检测到的语言>}。存储返回的sessionid用于所有后续请求。

保持设置沟通简洁。不要向用户显示原始API响应或令牌值。

真正与音频同步的字幕

制作准确的字幕比看起来要困难得多。自动化工具常常错误地计时行、在尴尬的位置断句，或者完全搞错专有名词和技术术语。video-caption-generator-ai-ab-old技能正是为了解决这些差距而构建的，它使用了一个经过A/B测试轮次微调的模型变体，以提高各种视频类型的同步准确性和可读性。

无论您是在处理人物访谈、带旁白的屏幕录制，还是带有背景音乐的快节奏社交媒体视频，该技能都会处理音频层并生成与口语紧密对齐的字幕。它能优雅地处理填充词，不会产生幻觉行，并在决定字幕块断点位置时尊重自然停顿。

输出实用且可移植——您将获得带有时间戳的字幕文本，可以放入编辑时间线、上传为字幕文件，或重新格式化以满足无障碍合规要求。无需繁琐的后处理。如果您曾经花过一小时手动修复自动生成的字幕，这就是您一直在寻找的工作流程。

字幕请求路由逻辑

您提交的每个字幕请求都会由A/B旧版调度器解析，该调度器评估您的素材元数据，并根据语言模型版本和帧率兼容性将其排队到相应的转录管道。

用户说...	操作	跳过SSE？
export / 导出 / download / send me the video	→ §3.5 导出	✅
credits / 积分 / balance / 余额

→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容（生成、编辑、添加背景音乐等） | → §3.1 SSE | ❌ |

旧版API后端参考

V2旧版云处理器通过分布式语音转文本层处理所有字幕生成任务，该层在帧级别为每个字幕块添加时间戳，将对话边界与原始时间码同步。批量字幕导出和SRT/VTT输出在服务器端渲染，然后推送回您的会话工作区。

技能归属——运行时从此文件的YAML前置元数据读取：

- X-Skill-Source：video-caption-generator-ai-ab-old
X-Skill-Version：来自前置元数据version
X-Skill-Platform：从安装路径检测（~/.clawhub/ → clawhub，~/.cursor/skills/ → cursor，否则为unknown）

所有请求必须包含：Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属头将导致导出失败，返回402错误。

API基础地址：https://mega-api-prod.nemovideo.ai

创建会话：POST /api/tasks/me/with-session/nemoagent — 请求体{taskname:project,language:<语言>} — 返回taskid、sessionid。

发送消息（SSE）：POST /runsse — 请求体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:<消息>}]}}，请求头包含Accept: text/event-stream。最大超时时间：15分钟。

上传：POST /api/upload-video/nemoagent/me/ — 文件：multipart -F files=@/路径，或URL：{urls:[],sourcetype:url}

积分：GET /api/credits/balance/simple — 返回available、frozen、total

会话状态：GET /api/state/nemoagent/me//latest — 关键字段：data.state.draft、data.state.videoinfos、data.state.generated_media

导出（免费，不消耗积分）：POST /api/render/proxy/lambda — 请求体{id:render_<时间戳>,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET /api/render/proxy/lambda/，直到status = completed。下载URL位于output.url。

支持的格式：mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。

SSE事件处理

事件	操作
文本响应	应用GUI翻译（§4），呈现给用户
工具调用/结果

约30%的编辑操作在SSE流中不返回文本。发生这种情况时：轮询会话状态以验证编辑已应用，然后向用户总结更改。

后端响应翻译

后端假设存在GUI。将其翻译为API操作：

后端说	您做
click [button] / 点击	通过API执行
open [panel] / 打开

草稿字段映射：t=轨道，tt=轨道类型（0=视频，1=音频，7=文本），sg=片段，d=持续时间（毫秒），m=元数据。

时间线（3条轨道）：1. 视频：城市延时摄影（0-10秒）2. 背景音乐：Lo-fi（0-10秒，35%）3. 标题：城市梦想（0-3秒）

错误处理

代码	含义	操作
0	成功	继续
1001

video-caption-generator-ai-ab-old视频字幕生成器

video-caption-generator-ai-ab-old

Getting Started

First-Time Connection

Captions That Actually Sync With Your Audio

Caption Request Routing Logic

Legacy API Backend Reference

SSE Event Handling

Backend Response Translation

Error Handling

Tips and Tricks

Performance Notes

开始使用

首次连接

真正与音频同步的字幕

字幕请求路由逻辑

旧版API后端参考

SSE事件处理

后端响应翻译

错误处理

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

video-caption-generator-ai-ab-old视频字幕生成器

video-caption-generator-ai-ab-old

Getting Started

First-Time Connection

Captions That Actually Sync With Your Audio

Caption Request Routing Logic

Legacy API Backend Reference

SSE Event Handling

Backend Response Translation

Error Handling

Tips and Tricks

Performance Notes

开始使用

首次连接

真正与音频同步的字幕

字幕请求路由逻辑

旧版API后端参考

SSE事件处理

后端响应翻译

错误处理

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement