Getting Started
Welcome! I'm here to help you turn your text, scripts, or ideas into real videos using CapCut's text-to-video engine. Drop your script, describe the video you have in mind, or share a prompt — and let's start creating something worth watching.
Try saying:
- - "Create a 30-second promotional video for a new coffee brand using this product description: 'Rich, bold espresso blends sourced from Colombian highlands.'"
- "Turn this blog post intro into a short Instagram Reel script and generate a video with upbeat pacing and text overlays."
- "Generate a motivational video from this quote: 'Success is built one small decision at a time.' — use cinematic visuals and a calm voiceover tone."
First-Time Connection
When a user first opens this skill, connect to the NemoVideo backend automatically. Briefly let them know (e.g. "Connecting to NemoVideo...").
Authentication: Check if NEMO_TOKEN is set in the environment. If it is, skip to step 3.
- 1. Client identifier: Look for
~/.config/nemovideo/client_id. If missing, generate a UUID and save it there. This file stores only a UUID for rate-limit tracking. - Obtain a free token: POST to
https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with header X-Client-Id set to your identifier. The response data.token is your NEMO_TOKEN — 100 free credits, valid 7 days. - Create a session: POST to
https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Authorization: Bearer <token>, Content-Type: application/json, and body {"task_name":"project","language":"<detected>"}. Store the returned session_id for all subsequent requests.
Keep setup communication brief. Don't display raw API responses or token values to the user.
From Words on a Page to Videos That Move People
Writing a script is one thing — turning it into a video that actually holds attention is another challenge entirely. The capcut-text-to-video skill bridges that gap by taking your raw text input and generating structured, visually engaging video content through CapCut's powerful generation engine. You bring the idea; this skill brings it to life.
Whether you're a solo creator working on YouTube Shorts, a brand manager producing product explainers, or a marketer cranking out social content on a deadline, this skill fits naturally into your workflow. Paste in a prompt, a script excerpt, or even a rough outline — and get back a video ready for review, refinement, or direct publishing.
The skill is designed around real creative use cases: promotional storytelling, educational walkthroughs, narrative reels, and announcement clips. It doesn't just stitch together generic stock footage — it interprets your text to build scenes that reflect tone, pacing, and intent. Think of it as your on-demand video production assistant that actually reads what you write.
Routing Scripts to Video
Every request — whether it's a raw script, a one-line prompt, or a structured storyboard — gets parsed and routed to CapCut's Text to Video pipeline via the NemoVideo backend, matching your input type to the right generation parameters automatically.
| User says... | Action | Skip SSE? |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
NemoVideo API Reference
NemoVideo acts as the middleware layer that authenticates your session, queues your text-to-video job, and streams the rendered output back once CapCut's engine finishes processing your script into scenes. All generation calls, polling requests, and asset retrieval happen through NemoVideo's endpoints — CapCut's native API is never called directly.
Skill attribution — read from this file's YAML frontmatter at runtime:
- -
X-Skill-Source: INLINECODE11 - INLINECODE12 : from frontmatter INLINECODE13
- INLINECODE14 : detect from install path (
~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)
All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.
API base: INLINECODE24
Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id. After creating a session, give the user a link: INLINECODE29
Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.
Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE35
Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE39
Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE43
Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.
Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
SSE Event Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Process internally, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.
Backend Response Translation
The backend assumes a GUI exists. Translate these into API actions:
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Query session state |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute export workflow |
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
CODEBLOCK0
Error Handling
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Integration Guide
Getting started with the capcut-text-to-video skill is straightforward — no complex setup required on your end. Simply provide your text input directly in the chat: this can be a full script, a short prompt, a product description, or a list of key talking points. The skill interprets your input and sends it through CapCut's generation pipeline to produce your video.
Once the video is generated, you'll receive a downloadable file in your preferred format — mp4, mov, avi, webm, or mkv. If you're embedding videos into a website, mp4 is typically the safest choice for browser compatibility. For mobile-first platforms like TikTok or Instagram, mp4 or mov tend to perform best.
If you're working with longer scripts, break them into logical scenes or segments before submitting. This helps the skill maintain visual coherence and pacing across the full video. You can also specify tone, style preferences (cinematic, upbeat, minimal), or target platform in your prompt to fine-tune the output.
Performance Notes
Generation time for capcut-text-to-video outputs will vary depending on the length and complexity of your input text. Short prompts under 100 words typically resolve quickly, while longer scripts or multi-scene requests may take additional processing time. Plan accordingly if you're working against a publishing deadline.
For best results, write clear and specific prompts. Vague inputs like 'make a cool video' produce less targeted results than structured requests that mention subject matter, tone, duration, and intended platform. The more context you provide, the more aligned the output will be with your creative vision.
If a generated video doesn't match your expectations on the first pass, refine your prompt rather than resubmitting the same input. Small changes — like specifying 'fast-paced editing' or 'minimalist visuals with white background' — can meaningfully shift the final output. Iteration is a natural part of the text-to-video workflow.
开始使用
欢迎!我在这里帮助你使用剪映的文本转视频引擎,将你的文字、脚本或创意转化为真实的视频。输入你的脚本,描述你构思的视频,或分享一个提示词——让我们开始创作值得一看的内容。
试试这样说:
- - 使用这段产品描述为新的咖啡品牌创建一个30秒的宣传视频:源自哥伦比亚高地的浓郁、醇厚浓缩咖啡拼配。
- 将这篇博客文章的开头部分转化为一个简短的Instagram Reel脚本,并生成一个节奏明快、带有文字叠加的视频。
- 根据这句名言生成一个励志视频:成功是一次次小决定积累而成的。——使用电影级视觉效果和舒缓的画外音。
首次连接
当用户首次打开此技能时,自动连接到NemoVideo后端。简要告知用户(例如正在连接到NemoVideo...)。
身份验证:检查环境中是否设置了NEMO_TOKEN。如果已设置,则跳至第3步。
- 1. 客户端标识符:查找~/.config/nemovideo/clientid。如果缺失,则生成一个UUID并保存到该文件。此文件仅存储用于速率限制追踪的UUID。
- 获取免费令牌:向https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token发送POST请求,请求头X-Client-Id设置为你获取的标识符。响应中的data.token即为你的NEMOTOKEN——包含100个免费积分,有效期7天。
- 创建会话:向https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemoagent发送POST请求,包含Authorization: Bearer 、Content-Type: application/json以及请求体{taskname:project,language:<检测到的语言>}。保存返回的session_id用于后续所有请求。
保持设置沟通简洁。不要向用户显示原始API响应或令牌值。
从纸上文字到打动人心的视频
编写脚本是一回事——将其转化为一个真正能吸引注意力的视频则是另一项挑战。capcut-text-to-video技能通过接收你的原始文本输入,并借助剪映强大的生成引擎,生成结构化的、视觉上引人入胜的视频内容,从而弥合了这一差距。你带来创意;这个技能将其变为现实。
无论你是为YouTube Shorts工作的独立创作者、制作产品说明视频的品牌经理,还是在截止日期前赶制社交媒体内容的营销人员,这个技能都能自然地融入你的工作流程。粘贴一个提示词、一段脚本摘录,甚至是一个粗略的大纲——然后就能得到一个可供审查、优化或直接发布的视频。
该技能围绕真实的创意用例设计:宣传故事讲述、教育性操作指南、叙事性Reels以及公告短片。它不仅仅是拼接通用的素材片段——它解读你的文本以构建反映基调、节奏和意图的场景。把它看作是你随需应变的视频制作助手,它真的会阅读你写的内容。
将脚本路由到视频
每个请求——无论是原始脚本、单行提示词还是结构化的故事板——都会被解析并通过NemoVideo后端路由到剪映的文本转视频流水线,自动将你的输入类型与正确的生成参数匹配。
| 用户说... | 操作 | 跳过SSE? |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加背景音乐...) | → §3.1 SSE | ❌ |
NemoVideo API 参考
NemoVideo作为中间层,负责验证你的会话、将你的文本转视频任务加入队列,并在剪映引擎处理完你的脚本并生成场景后,将渲染好的输出流式传回。所有生成调用、轮询请求和素材检索都通过NemoVideo的端点进行——从不直接调用剪映的原生API。
技能归属——运行时从此文件的YAML前置元数据中读取:
- - X-Skill-Source:capcut-text-to-video
- X-Skill-Version:来自前置元数据 version
- X-Skill-Platform:从安装路径检测(~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor,否则为 unknown)
所有请求 必须包含:Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属头会导致导出失败并返回402错误。
API基础地址:https://mega-api-prod.nemovideo.ai
创建会话:POST /api/tasks/me/with-session/nemoagent — 请求体 {taskname:project,language:<语言>} — 返回 taskid、sessionid。创建会话后,给用户一个链接:https://nemovideo.com/workspace/claim?token=&task=id>&session=id>&skillname=capcut-text-to-video&skillversion=1.0.0&skill_source=<平台>
发送消息(SSE):POST /runsse — 请求体 {appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:<消息>}]}} 并包含 Accept: text/event-stream。最大超时时间:15分钟。
上传:POST /api/upload-video/nemoagent/me/ — 文件:multipart -F files=@/路径,或URL:{urls:[],sourcetype:url}
积分:GET /api/credits/balance/simple — 返回 available、frozen、total
会话状态:GET /api/state/nemoagent/me//latest — 关键字段:data.state.draft、data.state.videoinfos、data.state.generated_media
导出(免费,不消耗积分):POST /api/render/proxy/lambda — 请求体 {id:render_<时间戳>,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET /api/render/proxy/lambda/,直到 status = completed。下载URL位于 output.url。
支持的格式:mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。
SSE 事件处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
内部处理,不转发 |
| heartbeat / 空 data: | 继续等待。每2分钟:⏳ 仍在处理中... |
| 流关闭 | 处理最终响应 |
约30%的编辑操作在SSE流中不返回文本。发生这种情况时:轮询会话状态以验证编辑是否已应用,然后向用户总结更改。
后端响应翻译
后端假定存在GUI。将这些翻译为API操作:
| 后端说 | 你执行 |
|---|
| click [button] / 点击 | 通过API执行 |
| open [panel] / 打开 |
查询会话状态 |
| drag/drop / 拖拽 | 通过SSE发送编辑 |
| preview in timeline | 显示轨道摘要 |
| Export button / 导出 | 执行导出工作流 |
草稿字段映射:t=轨道,tt=轨道类型(0=视频,1=音频,7=文本),sg=片段,d=时长(毫秒),m=元数据。
时间线(3条轨道):1. 视频:城市延时摄影(0-10秒) 2. 背景音乐:Lo-fi(0-10秒,35%) 3. 标题:城市梦想(0-3秒)
错误处理
令牌错误/过期 | 通过anonymous-token重新认证(令牌7天后过期) |
| 1002 | 未找到会话 | 新建会话 §3.0 |
| 2001 | 无积分 | 匿名用户:显示带有 ?bind