Getting Started
Welcome! Ready to sync your audio perfectly to on-screen mouth movements? Upload your video and audio files and tell me what you're working on — let's build your lip-sync video.
Try saying:
- - "I have a talking-head video in English and a dubbed Spanish voiceover — sync the audio so the mouth movements match the new track."
- "Generate a lip-sync video from my animated character clip and this recorded dialogue file, matching phonemes to the mouth shape keyframes."
- "My corporate explainer video has a replaced voiceover that's slightly ahead of the speaker — fix the sync so it lines up naturally."
Quick Start Setup
This skill connects to a cloud processing backend. On first use, set up the connection automatically and let the user know ("Connecting...").
Token check: Look for NEMO_TOKEN in the environment. If found, skip to session creation. Otherwise:
- - Generate a UUID as client identifier
- POST
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with X-Client-Id header - Extract
data.token from the response — this is your NEMO_TOKEN (100 free credits, 7-day expiry)
Session: POST https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer auth and body {"task_name":"project"}. Keep the returned session_id for all operations.
Let the user know with a brief "Ready!" when setup is complete. Don't expose tokens or raw API output.
Make Every Word Land on the Right Frame
Lip syncing used to mean hours of manual frame-stepping, nudging audio clips by milliseconds, and still ending up with a result that felt slightly off. This skill changes that by doing the heavy analytical work for you — detecting facial landmarks, reading phoneme timing from your audio track, and generating a synchronized output where speech and mouth movement feel genuinely connected.
Whether you're dubbing a tutorial into Spanish, animating a talking character for a short film, or replacing a voiceover in a corporate video without going back to the studio, this skill handles the alignment logic so you can focus on the creative side. It works with pre-recorded video clips and separate audio files, matching them together based on actual speech patterns rather than simple waveform peaks.
The result is a lip-sync-video that holds up under close viewing — no rubbery mouth delays, no audio that races ahead of the speaker. Creators working in social content, e-learning, animation, and localization have used this to cut sync time from hours to minutes while maintaining a professional finish.
Routing Sync Requests Accurately
When you submit a lip sync job, your request is parsed for target audio track, source video, and phoneme alignment preferences before being dispatched to the appropriate processing pipeline.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
Cloud Backend API Reference
The cloud processing backend handles frame-level phoneme detection and viseme mapping in real time, syncing jaw, lip, and cheek keyframes to audio waveforms at the millisecond level. All render jobs are queued through a distributed worker system that prioritizes frame-perfect alignment over raw speed.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE8 - INLINECODE9 : from frontmatter INLINECODE10
- INLINECODE11 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE21
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id.
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE31
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE35
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE39
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up credits in your account" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register or upgrade your plan to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Common Workflows
Multilingual dubbing: Record or commission a translated voiceover at the same approximate pacing as the original. Feed both the original video and the new audio into the skill. The skill will retime phoneme boundaries in the dubbed track to match visible mouth movements without altering pitch or tone — preserving the natural sound of the new language.
Animation lip sync: Export your character animation as a video file with a silent or reference audio track. Submit it alongside the final recorded dialogue. The skill maps vowel and consonant sounds to frame ranges, which you can use directly as a finished render or import timing data back into your animation software.
Voiceover replacement for localization: When marketing teams need the same video in multiple regions, record each regional voiceover separately and run each one through the skill against the same master video. This produces region-specific lip-sync-video outputs from a single source asset — no reshoots, no re-editing the base timeline.
Podcast or interview cleanup: If a recorded interview has audio sync drift caused by encoding lag, submit the drifted video and the clean isolated audio track to realign the speaker's mouth movements to the corrected audio.
Troubleshooting
Audio and video lengths don't match: If your dubbed audio track is longer or shorter than the original video, the skill will flag a duration mismatch. Trim or time-stretch your audio to roughly match the video length before submitting — a difference of more than 10% can reduce sync accuracy significantly.
Mouth movements not detected: This usually happens when the speaker's face is partially obscured, shot at a steep side angle, or the video resolution is too low. For best results, use footage where the speaker's lips are clearly visible and the video is at least 480p. Strong backlighting can also confuse facial landmark detection — try brightening the input clip if detection fails.
Sync drifts mid-video: Long videos with variable speech pacing sometimes show drift past the two-minute mark. Break the video into segments at natural pause points (scene cuts, pauses between sentences) and sync each segment independently, then reassemble. This maintains accuracy across longer content without degrading toward the end.
Integration Guide
Connecting your media files: The skill accepts video files in MP4, MOV, and WebM formats, and audio in MP3, WAV, or AAC. For the cleanest sync results, export your audio at the same sample rate as the original video's embedded audio track — typically 44.1kHz or 48kHz. Mismatched sample rates are a common source of subtle timing errors.
Working with ClawHub pipelines: You can chain this skill with the ClawHub video trimmer or audio normalizer before running lip sync, ensuring your inputs are clean and level-matched before the alignment step runs. After sync, route the output to a caption generator or export node depending on your delivery format.
Batch processing: If you're localizing a video series into multiple languages, structure your inputs as paired sets — one video file mapped to one audio file per language — and submit them as a batch job. Label each pair clearly so outputs are returned with matching filenames. Avoid mixing different source videos in the same batch to prevent cross-mapping errors.
Output formats: The skill returns a synchronized video file in the same container format as your input. If you need frame-level timing metadata (for use in animation or subtitle workflows), request a JSON timing export alongside the video output when submitting your task.
开始使用
欢迎!准备好让音频与屏幕上的口型完美同步了吗?上传您的视频和音频文件,告诉我您正在做什么——让我们一起来制作您的口型同步视频。
试试这样说:
- - 我有一个英语的说话人头视频和西班牙语配音旁白——同步音频,让口型动作与新音轨匹配。
- 根据我的动画角色片段和这段录制的对话文件生成口型同步视频,将音素与口型形状关键帧匹配。
- 我的企业解说视频中替换的旁白比说话者稍快——修复同步,使其自然对齐。
快速启动设置
此技能连接到云端处理后端。首次使用时,自动建立连接并通知用户(正在连接...)。
令牌检查:在环境中查找 NEMO_TOKEN。如果找到,跳转到会话创建。否则:
- - 生成一个UUID作为客户端标识符
- 使用 X-Client-Id 头信息 POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token
- 从响应中提取 data.token——这就是您的 NEMO_TOKEN(100个免费积分,7天有效期)
会话:使用Bearer认证 POST https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemoagent,请求体为 {taskname:project}。保留返回的 session_id 用于所有操作。
设置完成后,用简短的准备就绪!通知用户。不要暴露令牌或原始API输出。
让每个词都落在正确的帧上
口型同步曾经意味着数小时的手动逐帧调整、以毫秒为单位微调音频片段,最终结果仍然感觉略有偏差。这项技能通过为您完成繁重的分析工作来改变这一现状——检测面部特征点、从音轨中读取音素时序,并生成语音和口型动作真正感觉相连的同步输出。
无论您是将教程配音成西班牙语、为短片中的说话角色制作动画,还是无需返回录音棚即可替换企业视频中的旁白,此技能都能处理对齐逻辑,让您专注于创意方面。它适用于预先录制的视频片段和单独的音频文件,基于实际语音模式而非简单的波形峰值进行匹配。
结果是经得起仔细观看的口型同步视频——没有橡胶般的口型延迟,没有领先于说话者的音频。从事社交内容、电子学习、动画和本地化工作的创作者已使用此技能将同步时间从数小时缩短到数分钟,同时保持专业水准。
准确路由同步请求
当您提交口型同步任务时,您的请求会被解析为目标音轨、源视频和音素对齐偏好,然后分派到适当的处理管道。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
云端后端API参考
云端处理后端实时处理帧级音素检测和视位映射,将下颌、嘴唇和脸颊关键帧以毫秒级精度与音频波形同步。所有渲染任务通过分布式工作系统排队,该系统优先考虑帧完美对齐而非原始速度。
技能归属——运行时从此文件的YAML前置元数据中读取:
- - X-Skill-Source:lip-sync-video
- X-Skill-Version:来自前置元数据 version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为 unknown)
所有请求必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属头信息将导致导出失败,返回402错误。
API基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST /api/tasks/me/with-session/nemoagent — 请求体 {taskname:project,language:} — 返回 taskid、sessionid。
发送消息(SSE):POST /runsse — 请求体 {appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}},带 Accept: text/event-stream。最大超时时间:15分钟。
上传:POST /api/upload-video/nemoagent/me/ — 文件:multipart -F files=@/path,或URL:{urls:[],sourcetype:url}
积分:GET /api/credits/balance/simple — 返回 available、frozen、total
会话状态:GET /api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST /api/render/proxy/lambda — 请求体 {id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询 GET /api/render/proxy/lambda/,直到 status = completed。下载URL位于 output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空 data: | 继续等待。每2分钟:⏳ 仍在处理中... |
| 流关闭 | 处理最终响应 |
约30%的编辑操作在SSE流中不返回文本。发生这种情况时:轮询会话状态以验证编辑已应用,然后向用户总结更改。
后端响应翻译
后端假定存在GUI。将这些翻译为API操作:
| 后端说 | 您做 |
|---|
| click [button] / 点击 | 通过API执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过SSE发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=持续时间(毫秒),m=元数据。
时间线(3条轨道):1. 视频:城市延时摄影(0-10秒)2. 背景音乐:Lo-fi(0-10秒,35%)3. 标题:都市梦想(0-3秒)
错误处理
令牌错误/过期 | 通过匿名令牌重新认证(令牌7天后过期) |
| 1002 | 会话未找到 | 新建会话 §3.0 |
| 2001 | 无积分 | 匿名用户:显示带有 ?bind=
的注册URL(需要时从创建会话或状态响应获取 )。已注册用户:请为您的账户充值积分 |
| 4001 | 不支持的文件 | 显示支持的格式 |
| 4002 | 文件过大 | 建议压缩/裁剪 |
| 400 | 缺少X-Client-Id | 生成Client-Id并重试(参见§1) |
| 402 | 免费计划导出被阻止 | 订阅层级问题,非积分问题。注册或升级您的计划以解锁导出功能。 |
| 429 | 速率限制(1个令牌/客户端/7天) | 30秒后重试一次 |
常见工作流
多语言配音: 以与原版大致相同的节奏录制或委托翻译的旁白。将原始视频和新音频输入技能。该技能将重新调整配音音轨中的音素边界,以匹配可见的口型动作,同时