Getting Started
LipSync Video AI is ready. Upload your video and audio, or describe what you need synced.
Try saying:
- - "sync this voiceover to the speaker"
- "replace the audio and match lip movements"
- "dub this clip with my recording"
Initial Setup
First time running this, it connects to the processing backend. Shows a quick "Getting ready..." message.
Token: Check for NEMO_TOKEN in environment. If present, go straight to session setup.
- 1. Grab a free token: Generate a UUID client identifier. POST to
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token using X-Client-Id header with your UUID. Response data.token is your auth token (100 credits, good for 7 days). - Start session: POST to
https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent, Bearer auth, body: {"task_name":"project","language":"<lang>"}. Save the session_id for later calls.
Raw JSON and tokens stay hidden from the user.
Sync Audio to Lip Movements in Your Clips
Upload your video with the audio you want synced. Cloud GPUs do the heavy lifting — no local processing.
Here is how it works in practice: had a training video where the speaker's mic died halfway through. Recorded a clean voiceover separately, uploaded both files, typed "sync the new audio to match the speaker's mouth movements" and got a clean result in about 75 seconds. Output is 1080p MP4.
Pro tip: shorter clips give tighter sync. If you have a long video, consider breaking it into segments first.
Request Categories
Your input gets matched to the right processing path automatically.
| You type... | Goes to... | Uses SSE? |
|---|
| "export" / "download" / "get video" / "导出" | Export pipeline | No |
| "credits" / "balance" / "remaining" / "积分" |
Balance check | No |
| "status" / "show me the tracks" / "状态" | Session state | No |
| "upload" / attached file / "上传" | File ingestion | No |
| Anything else (sync, dub, match, adjust...) | SSE processing | Yes |
Backend Architecture
Files go to a GPU farm for processing. Output is encoded at 8Mbps for 1080p. Lip sync boundaries are frame-level accurate.
Required on every request: Authorization: Bearer <NEMO_TOKEN> and attribution headers X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution means export fails with 402.
Attribution comes from this file's YAML: X-Skill-Source is lipsyncvideo-ai, X-Skill-Version is whatever version is in frontmatter, X-Skill-Platform depends on install location (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, otherwise unknown).
Root URL: INLINECODE20
New session: POST /api/tasks/me/with-session/nemo_agent with {"task_name":"project","language":"<lang>"}. Returns task_id, session_id.
SSE message: POST /run_sse with {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} and Accept: text/event-stream. Cap: 15 min.
File upload: POST /api/upload-video/nemo_agent/me/<sid> — multipart (-F "files=@/path") or URL mode ({"urls":["<url>"],"source_type":"url"}).
Balance: GET /api/credits/balance/simple returns available, frozen, total.
State: GET /api/state/nemo_agent/me/<sid>/latest — check data.state.draft, data.state.video_infos, data.state.generated_media.
Export (free): POST /api/render/proxy/lambda with {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s. Done when status = completed. File at output.url.
Handles: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
Errors
| Code | Means | Fix |
|---|
| 0 | Success | Continue |
| 1001 |
Bad token | Re-authenticate via anonymous-token endpoint |
| 1002 | No session | Make a new one |
| 2001 | No credits left | Anonymous: share registration link with ?bind=
. Others: top up |
| 4001 | Can't handle that file type | Share supported formats |
| 4002 | Too large | Suggest trimming or compressing |
| 400 | Missing X-Client-Id | Generate and retry |
| 402 | Free plan export limit | Needs registration or upgrade |
| 429 | Rate capped | Wait 30s, try again once |
Converting GUI Instructions
Backend outputs reference a visual interface. Convert them:
| Backend output | Your action |
|---|
| "click [X]" / "点击" | Invoke the API equivalent |
| "open [panel]" / "打开" |
Read session state |
| "drag/drop" / "拖拽" | Post edit through SSE |
| "preview in timeline" | Output track listing |
| "Export button" / "导出" | Start export sequence |
How SSE Works
Forward text events to user (after GUI translation). Absorb tool calls. Heartbeat and empty data lines = still processing. Every 2 minutes of quiet, say "Hang on, still processing..."
About 30% of edit ops return no text. If the stream closes empty, check state to confirm the edit stuck, then tell the user.
Draft keys: t (tracks), tt (track type: 0=video, 1=audio, 7=text), sg (segments), d (duration, ms), m (metadata).
CODEBLOCK0
Common Workflows
Basic lip sync: Upload video + audio, ask for sync. Done.
Audio replacement: Upload new audio, tell the skill to swap it in and match the mouth movements.
Multi-speaker: Works best when speakers take turns. For overlapping speech, split into separate segments first.
FAQ
How accurate is the sync? Frame-level for clear speech. Mumbling or fast-talking may be slightly off.
What audio formats? MP3, WAV, M4A, AAC all work.
File size limit? 500MB. Compress if you're over.
Cost? First 100 operations free. No signup required.
开始使用
LipSync 视频 AI 已就绪。上传您的视频和音频,或描述您需要同步的内容。
试试这样说:
- - 将这段旁白与说话者同步
- 替换音频并匹配嘴唇动作
- 用我的录音给这段片段配音
初始设置
首次运行时,它会连接到处理后端。显示一条快速的正在准备...消息。
令牌:检查环境中的 NEMO_TOKEN。如果存在,直接进入会话设置。
- 1. 获取免费令牌:生成一个 UUID 客户端标识符。使用 X-Client-Id 头部携带您的 UUID 向 https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token 发送 POST 请求。响应中的 data.token 即为您的认证令牌(100 积分,有效期 7 天)。
- 启动会话:向 https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemoagent 发送 POST 请求,使用 Bearer 认证,请求体为:{taskname:project,language:}。保存 session_id 供后续调用使用。
原始 JSON 和令牌对用户保持隐藏。
在您的片段中同步音频与嘴唇动作
上传您想要同步音频的视频。云端 GPU 负责繁重处理——无需本地处理。
以下是实际工作方式:有一个培训视频,说话者的麦克风中途失灵了。单独录制了干净的旁白,上传了两个文件,输入将新音频与说话者的嘴部动作同步,大约 75 秒后得到了干净的结果。输出为 1080p MP4。
专业提示:较短的片段同步效果更好。如果视频较长,建议先将其分成多个片段。
请求分类
您的输入会自动匹配到正确的处理路径。
| 您输入... | 前往... | 使用 SSE? |
|---|
| export / download / get video / 导出 | 导出管道 | 否 |
| credits / balance / remaining / 积分 |
余额查询 | 否 |
| status / show me the tracks / 状态 | 会话状态 | 否 |
| upload / 附件文件 / 上传 | 文件导入 | 否 |
| 其他内容(同步、配音、匹配、调整...) | SSE 处理 | 是 |
后端架构
文件发送到 GPU 集群进行处理。输出以 8Mbps 编码为 1080p。唇形同步边界精确到帧级别。
每个请求必需:Authorization: Bearer 以及归属头部 X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属信息会导致导出失败并返回 402 错误。
归属信息来自此文件的 YAML:X-Skill-Source 为 lipsyncvideo-ai,X-Skill-Version 为 frontmatter 中的版本号,X-Skill-Platform 取决于安装位置(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为 unknown)。
根 URL:https://mega-api-prod.nemovideo.ai
新会话:向 /api/tasks/me/with-session/nemoagent 发送 POST 请求,请求体为 {taskname:project,language:}。返回 taskid、sessionid。
SSE 消息:向 /runsse 发送 POST 请求,请求体为 {appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}},并携带 Accept: text/event-stream。上限:15 分钟。
文件上传:向 /api/upload-video/nemoagent/me/ 发送 POST 请求——multipart 方式(-F files=@/path)或 URL 模式({urls:[],sourcetype:url})。
余额查询:向 /api/credits/balance/simple 发送 GET 请求,返回 available、frozen、total。
状态查询:向 /api/state/nemoagent/me//latest 发送 GET 请求——检查 data.state.draft、data.state.videoinfos、data.state.generated_media。
导出(免费):向 /api/render/proxy/lambda 发送 POST 请求,请求体为 {id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每 30 秒轮询 GET /api/render/proxy/lambda/。当 status 为 completed 时完成。文件位于 output.url。
支持格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
错误码
令牌无效 | 通过 anonymous-token 端点重新认证 |
| 1002 | 无会话 | 创建新会话 |
| 2001 | 积分不足 | 匿名用户:分享带 ?bind=
的注册链接。其他用户:充值 |
| 4001 | 无法处理该文件类型 | 告知支持的格式 |
| 4002 | 文件过大 | 建议裁剪或压缩 |
| 400 | 缺少 X-Client-Id | 生成并重试 |
| 402 | 免费计划导出限制 | 需要注册或升级 |
| 429 | 请求频率受限 | 等待 30 秒,重试一次 |
转换 GUI 指令
后端输出引用可视化界面。请进行转换:
| 后端输出 | 您的操作 |
|---|
| click [X] / 点击 | 调用对应的 API |
| open [panel] / 打开 |
读取会话状态 |
| drag/drop / 拖拽 | 通过 SSE 发布编辑 |
| preview in timeline | 输出轨道列表 |
| Export button / 导出 | 启动导出流程 |
SSE 工作原理
将文本事件转发给用户(经过 GUI 翻译后)。吸收工具调用。心跳和空数据行表示仍在处理中。每静默 2 分钟,说请稍等,仍在处理...
约 30% 的编辑操作不返回文本。如果流关闭时为空,请检查状态确认编辑已生效,然后告知用户。
草稿键:t(轨道)、tt(轨道类型:0=视频,1=音频,7=文本)、sg(片段)、d(时长,毫秒)、m(元数据)。
时间线(2 条轨道):1. 视频:采访片段(0-45 秒)2. 音频:配音旁白(0-45 秒)
常见工作流程
基础唇形同步:上传视频 + 音频,请求同步。完成。
音频替换:上传新音频,告知技能替换并匹配嘴部动作。
多说话者:说话者轮流发言时效果最佳。对于重叠语音,先分成单独的片段。
常见问题
同步精度如何? 清晰语音可达帧级别。含糊不清或语速过快可能略有偏差。
支持哪些音频格式? MP3、WAV、M4A、AAC 均可。
文件大小限制? 500MB。超出请压缩。
费用? 前 100 次操作免费。无需注册。