Getting Started
Welcome — you're one step away from perfectly synced speech in your video. Whether you're dubbing, localizing, or animating a face to match new audio, drop your video and audio details below and let's get started.
Try saying:
- - "I have a 60-second product demo video in English and I want to dub it into French with matching lip sync — the French audio file is ready to upload."
- "Can you animate this headshot photo of our CEO to speak a 30-second welcome message I recorded? I need it to look like a real talking-head video."
- "My recorded voiceover doesn't match the on-screen presenter's mouth movements because we re-recorded the script after filming — can you re-sync the lips to the new audio?"
Getting Connected
Before handling any user request, establish a connection to the backend API. Show a brief status like "Connecting...".
If NEMO_TOKEN is in the environment, use it directly and create a session. Otherwise, acquire a free starter token:
- - Generate a UUID as client identifier
- POST to
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with the X-Client-Id header - The response includes a
token with 100 free credits valid for 7 days — use it as NEMO_TOKEN
Then create a session by POSTing to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer authorization and body {"task_name":"project","language":"en"}. The session_id in the response is needed for all following requests.
Tell the user you're ready. Keep the technical details out of the chat.
Make Any Face Speak Any Words, Instantly
Lip Sync AI Video takes the friction out of one of video production's most tedious challenges: getting a person's mouth movements to match the audio they're supposed to be saying. Whether you're dubbing a product explainer into Spanish, animating a spokesperson photo, or fixing a mismatch between recorded audio and on-camera delivery, this skill handles the frame-level alignment automatically.
The underlying process analyzes facial landmarks in each frame of your video, then regenerates the mouth region to match the phonetic rhythm and shape of your target audio. The result is a natural, fluid lip movement that holds up under normal viewing conditions — no uncanny valley, no obvious patching.
This skill is built for practical production workflows. Marketers use it to localize ad campaigns without recasting talent. Educators use it to update course videos when scripts change. Podcasters and YouTubers use it to animate static profile images into engaging talking avatars. Whatever your use case, the goal is the same: believable speech-to-face synchronization with minimal manual effort.
Routing Your Lip Sync Requests
Each request — whether you're syncing a dubbed audio track, swapping dialogue, or animating a still face — gets parsed for target video, source audio, and face region before being dispatched to the appropriate lip sync pipeline.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
Cloud Rendering API Reference
Lip sync processing runs on a GPU-accelerated cloud backend that handles facial landmark detection, mouth region isolation, and frame-by-frame phoneme-to-viseme rendering entirely server-side. You never need local compute — the API accepts your video and audio assets, queues the synthesis job, and streams back the composited output once rendering completes.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE8 - INLINECODE9 : from frontmatter INLINECODE10
- INLINECODE11 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE21
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id.
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE31
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE35
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE39
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up credits in your account" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register or upgrade your plan to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Performance Notes
Lip sync quality is directly tied to the quality of your input files. Videos with a clear, front-facing or near-front-facing subject produce the most accurate mouth reconstruction — profiles beyond roughly 45 degrees will show reduced fidelity in the generated lip region. Lighting consistency across frames also matters; heavy flickering or extreme shadows near the mouth area can introduce artifacts.
For audio, clean mono or stereo recordings with minimal background noise yield the tightest phoneme mapping. Heavily compressed audio (low-bitrate MP3s) or recordings with significant reverb may cause slight timing drift on fast-speech segments. WAV or high-bitrate AAC files are preferred.
Processing time scales with video length and resolution. A 1080p, 2-minute clip typically completes in under 4 minutes. For batch localization jobs — syncing the same video to multiple language tracks — queuing them sequentially is more stable than simultaneous submissions.
Quick Start Guide
Getting your first lip-sync-ai-video result takes three inputs: your source video, your target audio, and any preferences about output format or quality level.
Step 1 — Prepare your video. Export or locate your source clip. MP4 with H.264 encoding works best. Trim it to only the segment that needs syncing — don't include long silent intros or outros, as these add processing time without benefit.
Step 2 — Prepare your audio. Your replacement audio should match the intended duration of the video segment. If the new audio is longer or shorter than the original, let us know whether you want the video stretched, trimmed, or padded with silence.
Step 3 — Submit and specify. Share your files or links and describe the subject (single speaker, multiple speakers, animated avatar, etc.). Mention the target language if it differs from the source, and flag any sections where sync accuracy is especially critical, such as close-up shots.
You'll receive a download link to your synced output along with a brief quality summary.
Best Practices
For the cleanest lip-sync-ai-video output, shoot or select source footage where the speaker's mouth is clearly visible and unobstructed — no hands near the face, no large mustaches covering the upper lip, and no motion blur from fast head movements.
When dubbing into another language, account for duration differences. Romance languages tend to run 15–20% longer than English for the same content. Either have your voice actor record a timed version matched to the original clip length, or allow for slight video speed adjustment in your brief.
If you're animating a still photo rather than a video, use a high-resolution image (at least 512×512px) with neutral expression and even lighting. Images where the subject is already mid-expression or laughing produce less natural results than a relaxed, closed-mouth or slightly open neutral pose.
For professional deliverables, always do a final review at 1x playback speed before publishing. Sync that looks slightly off at 2x speed is usually imperceptible at normal speed, but a real mismatch on a stressed syllable in a close-up will be noticeable — flag those moments for a targeted re-render.
开始使用
欢迎——您离视频中完美同步的语音仅一步之遥。无论您是在配音、本地化,还是让面部动画匹配新的音频,只需提供您的视频和音频详情,让我们开始吧。
试试这样说:
- - 我有一个60秒的英文产品演示视频,想用法语配音并匹配口型——法语音频文件已准备好上传。
- 能否将我们CEO的这张头像照片制作成动画,让他说出我录制的30秒欢迎词?我需要它看起来像真实的讲话视频。
- 我录制的画外音与屏幕上的演讲者口型不匹配,因为拍摄后我们重新录制了脚本——能否将口型与新音频重新同步?
建立连接
在处理任何用户请求之前,先建立与后端API的连接。显示简短的状态信息,如正在连接...。
如果环境中存在NEMO_TOKEN,直接使用它并创建会话。否则,获取免费的起始令牌:
- - 生成一个UUID作为客户端标识符
- 使用X-Client-Id头信息向https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token发送POST请求
- 响应中包含一个token,附带100个免费积分,有效期为7天——将其用作NEMO_TOKEN
然后创建会话,使用Bearer授权向https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemoagent发送POST请求,请求体为{taskname:project,language:en}。响应中的session_id在后续所有请求中都需要使用。
告诉用户您已准备就绪。将技术细节保留在聊天之外。
让任何面孔即时说出任何话语
口型同步AI视频消除了视频制作中最繁琐的挑战之一:让人物的口型动作与应该说的音频匹配。无论您是将产品讲解视频配音成西班牙语,为发言人照片制作动画,还是修复录制音频与镜头前表达之间的不匹配,这项技能都会自动处理逐帧对齐。
底层过程分析视频每一帧中的面部特征点,然后重新生成嘴部区域以匹配目标音频的语音节奏和形状。结果是在正常观看条件下自然流畅的口型运动——没有恐怖谷效应,没有明显的修补痕迹。
这项技能专为实际制作工作流程而构建。营销人员用它来本地化广告活动而无需重新选角。教育工作者在脚本更改时用它来更新课程视频。播客和YouTube创作者用它来将静态个人资料图像动画化为引人入胜的讲话头像。无论您的使用场景如何,目标都是一样的:用最少的手动工作实现可信的语音与面部同步。
路由您的口型同步请求
每个请求——无论您是在同步配音音轨、替换对话,还是为静态面部制作动画——都会被解析为目标视频、源音频和面部区域,然后分派到相应的口型同步管道。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
云端渲染API参考
口型同步处理在GPU加速的云端后端运行,完全在服务器端处理面部特征点检测、嘴部区域隔离以及逐帧音素到视位渲染。您永远不需要本地计算——API接受您的视频和音频素材,将合成任务加入队列,并在渲染完成后流式返回合成输出。
技能归属——运行时从此文件的YAML前置元数据中读取:
- - X-Skill-Source:lip-sync-ai-video
- X-Skill-Version:来自前置元数据version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为unknown)
所有请求必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属头信息将导致导出失败,返回402错误。
API基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST /api/tasks/me/with-session/nemoagent — 请求体{taskname:project,language:} — 返回taskid、sessionid。
发送消息(SSE):POST /runsse — 请求体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}},附带Accept: text/event-stream。最大超时时间:15分钟。
上传:POST /api/upload-video/nemoagent/me/ — 文件:multipart -F files=@/path,或URL:{urls:[],sourcetype:url}
积分:GET /api/credits/balance/simple — 返回available、frozen、total
会话状态:GET /api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST /api/render/proxy/lambda — 请求体{id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET /api/render/proxy/lambda/,直到status = completed。下载URL位于output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空data: | 继续等待。每2分钟:⏳ 仍在处理中... |
| 流关闭 | 处理最终响应 |
约30%的编辑操作在SSE流中不返回文本。当发生这种情况时:轮询会话状态以验证编辑是否已应用,然后向用户总结更改。
后端响应翻译
后端假设存在GUI。将这些翻译为API操作:
| 后端说 | 您做 |
|---|
| click [button] / 点击 | 通过API执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过SSE发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流程 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=时长(毫秒),m=元数据。
时间线(3条轨道):1. 视频:城市延时摄影(0-10秒)2. 背景音乐:Lo-fi(0-10秒,35%)3. 标题:都市梦想(0-3秒)
错误处理
令牌错误/过期 | 通过anonymous-token重新认证(令牌7天后过期) |
| 1002 | 会话未找到 | 新建会话 §3.0 |
| 2001 | 积分不足 | 匿名用户:显示注册URL,附带?bind=
(需要时从创建会话或状态响应中获取)。已注册用户:请为您的账户充值积分 |
| 4001 | 不支持的文件 | 显示支持的格式 |
| 4002 | 文件过大 | 建议压缩/裁剪 |
| 400 | 缺少X-Client-Id | 生成Client-Id并重试(参见§1) |
| 402 | 免费计划导出受限 | 订阅层级问题,非积分问题。注册或升级您的套餐以解锁导出功能。 |
| 429 | 速率限制(1个令牌/客户端/7天) | 30秒后重试