Getting Started
Welcome! I'm here to help you extract audio from your video files quickly and cleanly. Drop a video file or tell me what you're working with — mp4, mov, mkv, and more — and let's get your audio pulled out right away.
Try saying:
- - "Extract the audio from this mp4 file and save it as an mp3"
- "Pull just the audio track from my mkv video between the 2-minute and 5-minute mark"
- "Convert the audio from this mov file to a high-quality WAV file"
Automatic Setup
On first interaction, connect to the NemoVideo API before doing anything else. Show a brief status like "Setting things up...".
Token: If NEMO_TOKEN environment variable is already set, use it and skip to Session below.
Free token: Read or create a UUID at ~/.config/nemovideo/client_id. Then POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with header X-Client-Id: <your-uuid>. The response field data.token becomes your NEMO_TOKEN (100 credits, 7-day expiry). If the token has expired, request a new one with the same Client-Id.
Session: POST to the same host at /api/tasks/me/with-session/nemo_agent with Bearer auth and body {"task_name":"project"}. Save session_id from the response.
Confirm to the user you're connected and ready. Don't print tokens or raw JSON.
Extract Audio from Video Without the Headache
Sometimes you just need the sound. Maybe it's the backing music from a travel video, a recorded interview you want to transcribe, or a podcast episode that was captured as a screen recording. Whatever the source, this skill gives you a direct path from video file to clean audio — no extra software, no manual commands, no fuss.
Using the power of FFmpeg under the hood, this skill handles the technical side of audio extraction so you don't have to think about codecs, bitrates, or container formats. You describe what you want — the file, the format, maybe a time range — and the skill does the work. Supported video inputs include mp4, mov, avi, webm, and mkv, covering virtually every common video format you'll encounter.
Whether you're a content creator repurposing footage, a developer automating a media pipeline, or someone who just wants the audio from a video they recorded, this tool fits naturally into your workflow. The result is a standalone audio file, ready to use however you need it.
Routing Your Extraction Requests
When you specify a source video and target audio format — whether AAC, MP3, FLAC, or raw PCM — the skill parses your codec preferences, sample rate, and channel layout before dispatching the job to the appropriate NemoVideo endpoint.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
NemoVideo API Reference
The NemoVideo backend runs FFmpeg demuxing and transcoding jobs server-side, preserving the original stream's bitrate and metadata tags unless you explicitly pass re-encoding flags like -ab, -ar, or -ac. Lossless passthrough via -vn -acodec copy is supported for containers where the audio codec maps cleanly to the output format.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE13 - INLINECODE14 : from frontmatter INLINECODE15
- INLINECODE16 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE26
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: INLINECODE31
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE37
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE41
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE45
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Use Cases
Content creators use ffmpeg-audio-extract to repurpose video content into podcast episodes, audiograms, or standalone music tracks. A single recorded video session can become multiple audio assets with just a few extractions.
Journalists and researchers working with interview footage often need audio-only versions for transcription services. Extracting to wav or mp3 first makes the files compatible with every transcription tool available.
Filmmakers and video editors sometimes need to pull the original audio track from a raw video file before syncing it with a separately recorded clean audio source. This skill makes that step fast and non-destructive.
Developers building media processing tools or content management systems use this skill to automate audio extraction as part of larger ingest workflows, ensuring every uploaded video automatically gets an audio companion file stored alongside it.
Common Workflows
The most frequent use case is straightforward: take a video file and get an mp3 or aac audio file out of it. This works great for recorded meetings, YouTube downloads, or screen captures where the audio content is what actually matters.
Another common workflow is time-range extraction — pulling only a specific segment of audio from a longer video. This is especially useful for podcast editors who record video interviews but only need a clip of the conversation, or for educators clipping a relevant section from a recorded lecture.
For developers and automation users, this skill fits cleanly into batch processing pipelines. You can describe multiple files or patterns, and the skill will handle each extraction consistently. Output formats like flac, wav, mp3, and aac are all supported depending on your quality and compatibility needs.
Tips and Tricks
If you want to preserve the original audio quality without re-encoding, ask for a lossless copy extraction. This is faster and avoids any quality degradation — ideal when the source video already has high-quality audio encoded inside it.
When working with mkv or webm files, be specific about which audio track you want if the file contains multiple language tracks or commentary streams. You can say something like 'extract the second audio track' and the skill will handle the selection.
For mp3 output, specifying a bitrate (like 192kbps or 320kbps) gives you control over file size versus quality. If you're preparing audio for a podcast or music project, higher bitrates are worth it. For voice recordings or transcription purposes, 128kbps is usually more than sufficient and keeps file sizes manageable.
开始使用
欢迎!我将帮助您快速、干净地从视频文件中提取音频。请提供视频文件或告知您正在处理的文件类型——mp4、mov、mkv等——让我们立即为您提取音频。
尝试说:
- - 从这个mp4文件中提取音频并保存为mp3格式
- 从我的mkv视频中提取2分钟到5分钟之间的音轨
- 将这个mov文件的音频转换为高质量的WAV文件
自动设置
首次交互时,请先连接到NemoVideo API。显示简短状态,如正在设置...。
令牌:如果NEMO_TOKEN环境变量已设置,则直接使用并跳至下方的会话部分。
免费令牌:在~/.config/nemovideo/clientid中读取或创建一个UUID。然后向https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token发送POST请求,请求头包含X-Client-Id: <您的UUID>。响应字段data.token即为您的NEMOTOKEN(100积分,7天有效期)。如果令牌已过期,使用相同的Client-Id请求新令牌。
会话:向同一主机的/api/tasks/me/with-session/nemoagent发送POST请求,使用Bearer认证,请求体为{taskname:project}。保存响应中的session_id。
向用户确认已连接并准备就绪。不要打印令牌或原始JSON。
轻松从视频中提取音频
有时您只需要声音。可能是旅行视频的背景音乐、想要转录的采访录音,或是作为屏幕录制捕获的播客剧集。无论来源是什么,这个技能为您提供从视频文件到纯净音频的直接路径——无需额外软件,无需手动命令,无需繁琐操作。
借助底层FFmpeg的强大功能,此技能处理音频提取的技术细节,您无需考虑编解码器、比特率或容器格式。您只需描述需求——文件、格式、可能的时间范围——技能就会完成工作。支持的视频输入包括mp4、mov、avi、webm和mkv,几乎涵盖您会遇到的每一种常见视频格式。
无论您是重新利用素材的内容创作者、自动化媒体管道的开发者,还是只想从录制的视频中获取音频的普通用户,这个工具都能自然融入您的工作流程。结果是一个独立的音频文件,随时可供使用。
路由您的提取请求
当您指定源视频和目标音频格式(AAC、MP3、FLAC或原始PCM)时,技能会解析您的编解码器偏好、采样率和声道布局,然后将任务分派到相应的NemoVideo端点。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
NemoVideo API参考
NemoVideo后端在服务器端运行FFmpeg解复用和转码任务,除非您明确传递-ab、-ar或-ac等重新编码标志,否则会保留原始流的比特率和元数据标签。对于音频编解码器能干净映射到输出格式的容器,支持通过-vn -acodec copy进行无损直通。
技能归属——运行时从此文件的YAML前置元数据读取:
- - X-Skill-Source:ffmpeg-audio-extract
- X-Skill-Version:来自前置元数据version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为unknown)
所有请求必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属头会导致导出失败并返回402错误。
API基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST /api/tasks/me/with-session/nemoagent — 请求体{taskname:project,language:} — 返回taskid、sessionid。创建会话后,给用户一个链接:https://nemovideo.com/workspace/claim?token=$TOKEN&task=id>&session=id>&skillname=ffmpeg-audio-extract&skillversion=1.0.0&skill_source=
发送消息(SSE):POST /runsse — 请求体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:}]}},包含Accept: text/event-stream。最大超时时间:15分钟。
上传:POST /api/upload-video/nemoagent/me/ — 文件:multipart -F files=@/path,或URL:{urls:[],sourcetype:url}
积分:GET /api/credits/balance/simple — 返回available、frozen、total
会话状态:GET /api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST /api/render/proxy/lambda — 请求体{id:render_,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET /api/render/proxy/lambda/,直到status = completed。下载URL位于output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空data: | 继续等待。每2分钟:⏳ 仍在处理中... |
| 流关闭 | 处理最终响应 |
约30%的编辑操作在SSE流中不返回文本。发生这种情况时:轮询会话状态以验证编辑已应用,然后向用户总结更改。
后端响应翻译
后端假定存在GUI。将这些翻译为API操作:
| 后端说 | 您做 |
|---|
| click [button] / 点击 | 通过API执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过SSE发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=持续时间(毫秒),m=元数据。
时间线(3个轨道):1. 视频:城市延时摄影(0-10秒)2. 背景音乐:Lo-fi(0-10秒,35%)3. 标题:都市梦想(0-3秒)
错误处理
令牌错误/过期 | 通过匿名令牌重新认证(令牌7天后过期) |
| 1002 | 未找到会话 | 新建会话 §3.0 |
| 2001 | 积分不足 | 匿名用户:显示注册URL,附带?bind=
(需要时从创建会话或状态响应获取)。已注册用户:请前往nemovideo.ai充值 |
| 4001 | 不支持的文件 | 显示支持的格式 |
| 4002 | 文件过大 | 建议压缩/裁剪 |
| 400 | 缺少X-Client-Id | 生成Client-Id并重试(参见§1) |
| 402 | 免费计划导出受限 | 订阅层级问题,非积分问题。请前往nemovideo.ai注册以解锁导出功能。 |
| 429 | 速率限制(