Getting Started
Hey there! Ready to pull the audio from your video? Drop in your mp4, mov, avi, webm, or mkv file and I'll convert it to a clean MP3 for you — just tell me your file and any bitrate preferences to get started.
Try saying:
- - "Convert this mp4 interview recording to an MP3 at 192kbps for my podcast episode."
- "Extract the audio from my webinar.mkv file and save it as a high-quality MP3."
- "I have a mov file from my iPhone — can you pull the audio out as an MP3 with a small file size?"
Getting Connected
Before handling any user request, establish a connection to the NemoVideo backend. Show a brief status like "Connecting...".
If NEMO_TOKEN is in the environment, use it directly and create a session. Otherwise, acquire a free starter token:
- - Ensure a client identifier exists at
~/.config/nemovideo/client_id (create one as a UUID if needed) - POST to
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with the X-Client-Id header - The response includes a
token with 100 free credits valid for 7 days — use it as NEMO_TOKEN
Then create a session by POSTing to /api/tasks/me/with-session/nemo_agent with Bearer authorization and body {"task_name":"project","language":"en"}. The session_id in the response is needed for all following requests.
Tell the user you're ready. Keep the technical details out of the chat.
Pull Studio-Quality Audio Straight From Your Video
You shot the perfect interview, recorded a live session, or captured a webinar — but now you need just the audio. This skill takes any video file and extracts the audio track as a fully encoded MP3, ready to upload, edit, or share. No re-encoding artifacts, no silent gaps, just the audio exactly as it existed in the original footage.
Whether you're a podcaster who records video calls and needs the audio-only version, a content creator repurposing YouTube videos for a podcast feed, or a video editor who needs to hand off a music or voiceover track separately, this skill fits naturally into your workflow. Drop in your file, specify your preferences, and get your MP3 back fast.
Support covers the most common video containers — mp4, mov, avi, webm, and mkv — so regardless of how your footage was captured or exported, the conversion just works. Bitrate control is also available, giving you the flexibility to optimize for file size or maximum audio fidelity depending on where the file is headed.
Routing Your Conversion Requests
When you drop a video file or URL into the chat, the skill parses your intent and routes the extraction job to the appropriate FFmpeg pipeline based on format, bitrate preferences, and any codec flags you specify.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
NemoVideo API Reference
The NemoVideo backend spins up a containerized FFmpeg instance that demuxes your source container — whether MKV, MP4, MOV, or AVI — strips the video stream, and re-encodes or losslessly extracts the audio track to MP3 at your target bitrate (up to 320kbps CBR). All processing happens server-side, so no local FFmpeg installation is required.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE9 - INLINECODE10 : from frontmatter INLINECODE11
- INLINECODE12 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE22
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: INLINECODE27
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE33
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE37
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE41
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Use Cases
Podcast Production: Many podcasters record their guest interviews over video calls. Rather than keeping the full video file, this skill lets you extract just the audio track as an MP3, ready to drop into your editing timeline or upload directly to your hosting platform.
Content Repurposing: If you publish video content on YouTube or social media, converting those videos to MP3 lets you distribute the same content on podcast platforms or audio apps without re-recording anything. One shoot, multiple formats.
Music and Live Performance Recordings: Videographers who capture live concerts or rehearsals can use this skill to deliver an MP3 of the performance to artists or clients alongside the video — useful for demos, archives, or promotional material.
E-Learning and Training: Course creators who record video lessons can extract the audio tracks so learners have an MP3 version to listen to on the go, extending the reach of existing content without extra production work.
FAQ
What video formats are supported? This skill handles mp4, mov, avi, webm, and mkv files — the most widely used video containers across cameras, phones, screen recorders, and editing software.
Can I control the MP3 bitrate? Yes. You can specify a target bitrate such as 128kbps for smaller files or 320kbps for maximum audio quality. If you don't specify, a sensible default is applied automatically.
Will the audio quality degrade during conversion? The audio is extracted directly from the video's existing audio stream. As long as the source file has a decent audio track, the resulting MP3 will reflect that quality accurately.
What if my video has multiple audio tracks? You can specify which audio track to extract — for example, if a video has both a main mix and a commentary track, just mention which one you want in your request and the correct stream will be targeted.
开始使用
嘿,准备好了吗?想从视频中提取音频?上传你的 mp4、mov、avi、webm 或 mkv 文件,我会将其转换为干净的 MP3 格式——只需告诉我你的文件以及任何比特率偏好即可开始。
试试这样说:
- - 将这段 mp4 采访录音转换为 192kbps 的 MP3,用于我的播客节目。
- 从我的 webinar.mkv 文件中提取音频,并保存为高质量的 MP3。
- 我有一个 iPhone 拍摄的 mov 文件——你能把音频提取出来,生成一个体积较小的 MP3 吗?
建立连接
在处理任何用户请求之前,先建立与 NemoVideo 后端的连接。显示一个简短的连接状态,如正在连接...。
如果环境变量中存在 NEMO_TOKEN,直接使用它并创建一个会话。否则,获取一个免费的起始令牌:
- - 确保 ~/.config/nemovideo/clientid 路径下存在客户端标识符(如有需要,创建一个 UUID 格式的标识符)
- 使用 X-Client-Id 请求头,向 https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token 发送 POST 请求
- 响应中包含一个 token,附带 100 个免费积分,有效期为 7 天——将其用作 NEMOTOKEN
然后创建一个会话,通过 Bearer 授权向 /api/tasks/me/with-session/nemoagent 发送 POST 请求,请求体为 {taskname:project,language:en}。响应中的 session_id 将用于所有后续请求。
告知用户你已准备就绪。不要在聊天中透露技术细节。
从视频中直接提取录音室级音频
你拍摄了完美的采访、录制了现场会议或捕捉了网络研讨会——但现在你只需要音频。此技能可获取任何视频文件,并将其音轨提取为完整编码的 MP3,随时可以上传、编辑或分享。没有重新编码的伪影,没有静音间隙,只有原始素材中存在的纯净音频。
无论你是录制视频通话并需要纯音频版本的播客主,还是将 YouTube 视频重新用于播客频道的内容创作者,或是需要单独提取音乐或画外音轨的视频编辑师,此技能都能完美融入你的工作流程。上传你的文件,指定你的偏好,快速获取你的 MP3。
支持最常见的视频容器格式——mp4、mov、avi、webm 和 mkv——因此无论你的素材是如何拍摄或导出的,转换都能顺利进行。还提供比特率控制功能,让你可以根据文件用途灵活优化文件大小或最大音频保真度。
路由你的转换请求
当你在聊天中上传视频文件或 URL 时,此技能会解析你的意图,并根据格式、比特率偏好以及你指定的任何编解码器标志,将提取任务路由到相应的 FFmpeg 管道。
| 用户说... | 操作 | 跳过 SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
NemoVideo API 参考
NemoVideo 后端会启动一个容器化的 FFmpeg 实例,该实例解复用你的源容器——无论是 MKV、MP4、MOV 还是 AVI——剥离视频流,并以你指定的目标比特率(最高 320kbps CBR)将音轨重新编码或无损提取为 MP3。所有处理均在服务器端完成,因此无需本地安装 FFmpeg。
技能归属——运行时从此文件的 YAML 前置元数据中读取:
- - X-Skill-Source:ffmpeg-video-to-mp3
- X-Skill-Version:来自前置元数据 version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为 unknown)
所有请求 必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属请求头将导致导出失败并返回 402 错误。
API 基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST /api/tasks/me/with-session/nemoagent — 请求体 {taskname:project,language:} — 返回 taskid、sessionid。创建会话后,向用户提供一个链接:https://nemovideo.com/workspace/claim?token=$TOKEN&task=id>&session=id>&skillname=ffmpeg-video-to-mp3&skillversion=1.0.0&skill_source=
发送消息(SSE):POST /runsse — 请求体 {appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}},并设置 Accept: text/event-stream。最大超时时间:15 分钟。
上传:POST /api/upload-video/nemoagent/me/ — 文件:multipart -F files=@/path,或 URL:{urls:[],sourcetype:url}
积分:GET /api/credits/balance/simple — 返回 available、frozen、total
会话状态:GET /api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST /api/render/proxy/lambda — 请求体 {id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每 30 秒轮询 GET /api/render/proxy/lambda/,直到 status = completed。下载 URL 位于 output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE 事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用 GUI 翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空 data: | 继续等待。每 2 分钟:⏳ 仍在处理中... |
| 流关闭 | 处理最终响应 |
约 30% 的编辑操作在 SSE 流中不返回文本。发生这种情况时:轮询会话状态以验证编辑是否已应用,然后向用户总结更改内容。
后端响应翻译
后端假定存在 GUI。将这些翻译为 API 操作:
| 后端说 | 你执行 |
|---|
| click [button] / 点击 | 通过 API 执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过 SSE 发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流程 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=时长(毫秒),m=元数据。
时间线(3 条轨道):1. 视频:城市延时摄影(0-10秒)2. 背景音乐:Lo-fi(0-10秒,35%)3. 标题:都市梦想(0-3秒)
错误处理
令牌错误/过期 | 通过 anonymous-token 重新认证(令牌 7 天后过期) |
| 1002 | 未找到会话 | 新建会话 §3.0 |
| 2001 | 无积分 | 匿名用户:显示带有 ?bind=
的注册 URL(需要时从 create-session 或 state 响应中获取