Getting Started
Welcome to your free lip sync AI workspace — where out-of-sync audio and awkward dubbed footage become a thing of the past. Share your video and audio files to get started, or describe your lip sync challenge and I'll walk you through it.
Try saying:
- - "Sync mouth movements to new audio"
- "Fix drifting lip sync in video"
- "Remap dubbed dialogue to face"
Automatic Setup
On first interaction, connect to the processing API before doing anything else. Show a brief status like "Setting things up...".
Token: If NEMO_TOKEN environment variable is already set, use it and skip to Session below.
Free token: Generate a UUID as client identifier, then POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with header X-Client-Id: <uuid>. The response field data.token becomes your NEMO_TOKEN (100 credits, 7-day expiry).
Session: POST to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer auth and body {"task_name":"project"}. Save session_id from the response.
Confirm to the user you're connected and ready. Don't print tokens or raw JSON.
Sync Every Word Without Touching the Timeline
Getting lip sync right has traditionally required expensive software, a skilled editor, and hours of frame-by-frame adjustments. This free-lip-sync-ai skill changes that entirely — just bring your video and your audio, and let the AI handle the alignment automatically.
The skill works by detecting facial landmarks in your footage, analyzing the phoneme structure of your audio track, and remapping mouth movements so they match what's being said. Whether you're dubbing a foreign-language film, syncing a voiceover to an animated character, or fixing a recording where audio and video drifted apart, the result is a natural, believable performance that holds up under scrutiny.
This is built for creators who move fast and can't afford to lose hours on technical corrections. Social media producers, indie filmmakers, YouTube educators, and localization teams will find this skill especially useful. No prior video editing knowledge is required — describe your project, share your files, and get synced output ready to publish.
Routing Sync Requests Intelligently
When you submit an audio track, your lip sync request is parsed for phoneme timing, speaker count, and output format before being dispatched to the optimal processing node.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
Cloud Processing API Reference
The backend runs a phoneme-to-viseme mapping pipeline that analyzes your audio waveform in real time, generating frame-accurate mouth shape sequences at up to 60fps. All rendering jobs are queued through a distributed cloud engine, so heavy files won't stall your session.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE8 - INLINECODE9 : from frontmatter INLINECODE10
- INLINECODE11 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE21
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id.
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE31
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE35
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE39
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up credits in your account" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register or upgrade your plan to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Integration Guide
Getting free-lip-sync-ai into your existing production workflow is straightforward. For individual creators, the simplest path is to export your video and audio as separate files from your editing timeline — most NLEs like DaVinci Resolve, Premiere Pro, or Final Cut Pro support this natively — then pass both files to this skill for sync processing before reimporting the result.
For teams handling batch localization work, you can describe multiple video-audio pairs in a single session and receive queued processing instructions. The skill supports common video formats including MP4, MOV, and WebM, and accepts audio in MP3, WAV, and AAC formats.
Once synced output is returned, it drops directly back into your timeline as a replacement clip with no additional reformatting needed. For recurring workflows — such as a weekly dubbed video series — you can save your project parameters as a prompt template and reuse them each session to maintain consistent output settings across episodes.
Performance Notes
Free-lip-sync-ai performs best when the source video contains a clearly visible, front-facing or near-front-facing subject with unobstructed mouth visibility. Heavily side-angled shots, extreme close-ups of non-facial areas, or footage with persistent occlusion (hands over the mouth, masks, heavy beards) may reduce landmark detection accuracy and result in less precise sync.
For audio, clean mono or stereo tracks with minimal background noise produce the sharpest phoneme mapping. Heavily compressed audio or tracks with significant reverb can cause the AI to misread consonant boundaries, slightly affecting timing precision on fast speech passages.
Video resolution of 720p or higher is recommended for reliable facial landmark tracking. Lower-resolution footage is still processable but may yield softer sync accuracy, particularly around subtle mouth shapes like 'f', 'v', and 'th' sounds. Processing time scales with clip length — shorter clips under 3 minutes typically return results significantly faster.
Use Cases
Free-lip-sync-ai covers a wide range of practical production scenarios that come up constantly for modern creators. The most common use case is video localization — when a video is dubbed into another language, the translated audio rarely matches the original mouth timing, and this skill re-syncs the face to the new track seamlessly.
Content creators who record voiceovers separately from their on-camera footage use this skill to fix natural drift that occurs when audio and video are captured on different devices or edited independently. Even a 200ms offset is noticeable to viewers, and this skill corrects it in one pass.
Animation studios and indie game developers use free-lip-sync-ai to apply dialogue audio to character models without manual keyframe animation. Podcast producers who record video alongside audio and experience sync issues during export also benefit heavily. It's equally useful for corporate training videos, e-learning modules, and accessibility-focused re-dubbing projects where clear, believable speech presentation matters.
开始使用
欢迎来到你的免费AI唇形同步工作空间——在这里,音频与画面不同步以及尴尬的配音素材将成为过去。分享你的视频和音频文件即可开始,或描述你的唇形同步难题,我将引导你完成。
试试说:
- - 将嘴部动作与新音频同步
- 修复视频中漂移的唇形同步
- 将配音对话重新映射到面部
自动设置
首次交互时,先连接到处理API再执行其他操作。显示简短状态,如正在设置....
令牌:如果已设置NEMO_TOKEN环境变量,则使用它并跳至下方的会话部分。
免费令牌:生成一个UUID作为客户端标识符,然后向https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token发送POST请求,附带标头X-Client-Id: 。响应字段data.token即为你的NEMO_TOKEN(100积分,7天有效期)。
会话:向https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemoagent发送POST请求,使用Bearer认证,请求体为{taskname:project}。保存响应中的session_id。
向用户确认已连接并准备就绪。不要打印令牌或原始JSON。
无需操作时间线即可同步每个字词
传统上,实现精确的唇形同步需要昂贵的软件、熟练的剪辑师以及数小时的逐帧调整。这项免费AI唇形同步技能彻底改变了这一点——只需提供你的视频和音频,让AI自动处理对齐。
该技能通过检测素材中的面部特征点、分析音轨的音素结构,并重新映射嘴部动作使其与所说内容匹配。无论你是为外语电影配音、将画外音同步到动画角色,还是修复音频与视频不同步的录制,最终都能获得经得起审视的自然、可信的表现。
这是为追求效率、无法在技术修正上浪费数小时的创作者打造的。社交媒体制作人、独立电影人、YouTube教育工作者和本地化团队会发现这项技能特别有用。无需任何视频剪辑知识——描述你的项目,分享你的文件,即可获得可发布的同步输出。
智能路由同步请求
当你提交音轨时,你的唇形同步请求会在被发送到最佳处理节点之前,解析音素时间、说话者数量和输出格式。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加背景音乐...) | → §3.1 SSE | ❌ |
云端处理API参考
后端运行一个音素到视素映射管道,实时分析你的音频波形,生成高达60fps的帧精确嘴形序列。所有渲染任务通过分布式云引擎排队,因此大文件不会阻塞你的会话。
技能归属——运行时从此文件的YAML前置元数据读取:
- - X-Skill-Source: free-lip-sync-ai
- X-Skill-Version: 来自前置元数据version
- X-Skill-Platform: 从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为unknown)
所有请求必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属标头将导致导出失败,返回402。
API基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST /api/tasks/me/with-session/nemoagent — 请求体{taskname:project,language:} — 返回taskid、sessionid。
发送消息(SSE):POST /runsse — 请求体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}},附带Accept: text/event-stream。最大超时时间:15分钟。
上传:POST /api/upload-video/nemoagent/me/ — 文件:multipart -F files=@/path,或URL:{urls:[],sourcetype:url}
积分:GET /api/credits/balance/simple — 返回available、frozen、total
会话状态:GET /api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST /api/render/proxy/lambda — 请求体{id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET /api/render/proxy/lambda/,直到status = completed。下载URL位于output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空data: | 继续等待。每2分钟:⏳ 仍在处理... |
| 流关闭 | 处理最终响应 |
约30%的编辑操作在SSE流中不返回文本。发生这种情况时:轮询会话状态以验证编辑已应用,然后向用户总结更改。
后端响应翻译
后端假设存在GUI。将其翻译为API操作:
| 后端说 | 你执行 |
|---|
| click [button] / 点击 | 通过API执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过SSE发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=时长(毫秒),m=元数据。
时间线(3条轨道):1. 视频:城市延时摄影(0-10秒)2. 背景音乐:Lo-fi(0-10秒,35%)3. 标题:城市梦想(0-3秒)
错误处理
令牌错误/过期 | 通过匿名令牌重新认证(令牌7天后过期) |
| 1002 | 未找到会话 | 新建会话 §3.0 |
| 2001 | 无积分 | 匿名用户:显示注册URL,附带?bind=
(需要时从创建会话或状态响应获取)。已注册用户:请在你的账户中充值积分 |
| 4001 | 不支持的文件 | 显示支持的格式 |
| 4002 | 文件过大 | 建议压缩/裁剪 |
| 400 | 缺少X-Client-Id | 生成Client-Id并重试(见§1) |
| 402 | 免费计划导出被阻止 | 订阅层级问题,非积分问题。请注册或升级你的计划以解锁导出功能。 |
| 429 | 速率限制(1个令牌/客户端/7天) | 等待30秒后重试一次 |
集成指南
将免费AI唇形同步技能集成到你现有的制作工作流中非常简单。对于个人创作者,最简单的方法是从你的剪辑时间线中将视频和音频导出为单独的文件——大多数非线性编辑系统如DaVinci Resolve、Premiere Pro或Final Cut Pro都原生支持——然后将这两个文件传递给此技能进行同步处理,最后重新导入结果。
对于处理批量本地化工作的团队,你可以在单个会话中描述多个视频-音频对,并接收排队处理指令。该技能支持常见的视频格式,包括MP4、MOV和WebM,并接受MP3