Getting Started

Welcome — you're one text prompt away from a generated video. Describe your scene, concept, or story and this text-to-video generator will build it into visual content for you. Drop your prompt below to get started.

Try saying:

- "Generate a 30-second promotional video for a coffee brand using a warm, cinematic morning aesthetic with soft lighting and close-up shots of a steaming cup"
"Create a short explainer video about how solar panels work, aimed at middle school students, using simple animations and an upbeat visual style"
"Turn this product description into a 15-second social media video clip with bold text overlays and high-energy pacing: 'Our new running shoe is built for speed, comfort, and all-terrain grip'"

First-Time Connection

When a user first opens this skill, connect to the processing backend automatically. Briefly let them know (e.g. "Setting up...").

Authentication: Check if NEMO_TOKEN is set in the environment. If it is, skip to step 2.

1. Obtain a free token: Generate a random UUID as client identifier. POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with header X-Client-Id set to that UUID. The response data.token is your NEMO_TOKEN — 100 free credits, valid 7 days.
Create a session: POST to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Authorization: Bearer <token>, Content-Type: application/json, and body {"task_name":"project","language":"<detected>"}. Store the returned session_id for all subsequent requests.

Keep setup communication brief. Don't display raw API responses or token values to the user.

From Words on a Page to Video That Moves

Most video creation tools assume you already have footage. This one doesn't. The text-to-video-generator-ai skill starts with nothing but your words — a sentence, a paragraph, a creative brief — and builds video content around them from scratch.

Describe a product launch, a short story, a social media ad, or an explainer concept, and the skill interprets your intent, selects appropriate visual styles, and assembles a coherent video sequence. You control the tone, the subject matter, and the narrative arc. The skill handles the visual translation.

This is especially useful for teams that move fast and need video assets without a full production pipeline. Marketers can prototype ad concepts before committing to a shoot. Educators can generate illustrative clips for lessons. Indie creators can visualize scripts before filming. Whatever your workflow, this skill fits in as the step between idea and execution.

Prompt Routing and Model Dispatch

Each text prompt you submit is parsed for scene complexity, motion descriptors, and style tags before being dispatched to the optimal diffusion model pipeline for rendering.

User says...	Action	Skip SSE?
"export" / "导出" / "download" / "send me the video"	→ §3.5 Export	✅
"credits" / "积分" / "balance" / "余额"

→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |

Video Synthesis API Reference

All video generation jobs run on distributed GPU clusters via an asynchronous cloud rendering backend, with frame synthesis, temporal coherence processing, and output encoding handled server-side. Your generated video assets are stored in a secure session bucket and delivered via signed CDN URL upon job completion.

Skill attribution — read from this file's YAML frontmatter at runtime:

- X-Skill-Source: INLINECODE10
INLINECODE11: from frontmatter INLINECODE12
INLINECODE13: detect from install path (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)

All requests must include: Authorization: Bearer <NEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.

API base: INLINECODE23

Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"<lang>"} — returns task_id, session_id.

Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.

Upload: POST /api/upload-video/nemo_agent/me/<sid> — file: multipart -F "files=@/path", or URL: INLINECODE33

Credits: GET /api/credits/balance/simple — returns available, frozen, INLINECODE37

Session state: GET /api/state/nemo_agent/me/<sid>/latest — key fields: data.state.draft, data.state.video_infos, INLINECODE41

Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s until status = completed. Download URL at output.url.

Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

SSE Event Handling

Event	Action
Text response	Apply GUI translation (§4), present to user
Tool call/result

~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.

Backend Response Translation

The backend assumes a GUI exists. Translate these into API actions:

Backend says	You do
"click [button]" / "点击"	Execute via API
"open [panel]" / "打开"

Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.

CODEBLOCK0

Error Handling

Code	Meaning	Action
0	Success	Continue
1001

Quick Start Guide

Getting your first video generated is straightforward. Start by writing a clear, specific text prompt describing what you want the video to show. Include details like subject matter, visual tone, intended audience, and approximate length if you have a target in mind.

For example, instead of typing 'make a video about dogs,' try 'create a 20-second upbeat video about golden retriever puppies playing in a park, suitable for a pet adoption campaign.' The more context you provide, the closer the output will match your vision.

Once you submit your prompt, the skill processes your description and returns a generated video or a preview sequence. You can then refine by adjusting your prompt — tightening the mood, changing the pacing description, or specifying a different visual style. Iteration is fast, so don't hesitate to run multiple variations.

Performance Notes

Video generation quality scales directly with prompt specificity. Vague prompts produce generic results; detailed prompts produce targeted, usable content. If your first output feels off-brand or misaligned, the most effective fix is almost always a more descriptive prompt rather than regenerating with the same input.

Longer videos with complex scene transitions take more processing time than short, single-scene clips. If speed matters, break a longer concept into shorter segments and generate them individually before combining.

Text-heavy scenes — like those with on-screen titles, captions, or data visualizations — benefit from explicitly stating font style preferences and placement in your prompt. Without guidance, the skill defaults to standard visual layouts, which may not match your brand guidelines.

Best Practices

Lead with the end goal. Before describing visuals, state the purpose of the video — is it to sell, educate, entertain, or inspire? This framing shapes how the skill interprets everything that follows.

Use reference anchors when possible. Phrases like 'cinematic documentary style,' 'fast-cut social media reel,' or 'minimalist whiteboard animation' give the generator clear stylistic targets that dramatically improve output consistency.

Avoid overloading a single prompt with too many competing ideas. If your concept has multiple distinct segments — an intro, a product demo, and a call-to-action — describe each section separately or structure your prompt with clear scene breaks. This produces cleaner transitions and more coherent pacing across the full video.

开始使用

欢迎——你只需输入一段文字提示，就能生成视频。描述你的场景、概念或故事，这个文本转视频生成器将为你将其构建成视觉内容。在下方输入你的提示词即可开始。

试试这样说：

- 为咖啡品牌生成一段30秒的推广视频，采用温暖的电影感晨间美学，柔和光线搭配热气腾腾咖啡杯的特写镜头
制作一段关于太阳能电池板工作原理的简短解说视频，面向初中生，使用简单动画和明快的视觉风格
将这段产品描述转化为15秒的社交媒体视频片段，搭配粗体文字叠加和快节奏剪辑：我们的新款跑鞋专为速度、舒适度和全地形抓地力而设计

首次连接

当用户首次打开此技能时，自动连接到处理后端。简要告知用户（例如正在设置...）。

身份验证：检查环境中是否设置了NEMO_TOKEN。如果已设置，直接跳至步骤2。

1. 获取免费令牌：生成随机UUID作为客户端标识符。向https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token发送POST请求，请求头X-Client-Id设置为该UUID。响应中的data.token即为你的NEMOTOKEN——100个免费积分，有效期7天。
创建会话：向https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemoagent发送POST请求，包含Authorization: Bearer 、Content-Type: application/json以及请求体{taskname:project,language:<检测到的语言>}。存储返回的sessionid用于所有后续请求。

保持设置过程简洁。不要向用户显示原始API响应或令牌值。

从文字到动态视频

大多数视频创作工具都假设你已有素材。但这个工具不同。文本转视频生成器技能仅以你的文字为起点——一句话、一段文字、一份创意简报——并围绕它们从头构建视频内容。

描述产品发布、短篇故事、社交媒体广告或解说概念，该技能会解读你的意图，选择合适的视觉风格，并组合成连贯的视频序列。你控制基调、主题和叙事弧线。技能负责视觉转化。

这对于需要快速行动且无需完整制作流程即可获得视频资产的团队特别有用。营销人员可以在正式拍摄前制作广告概念原型。教育工作者可以为课程生成说明性片段。独立创作者可以在拍摄前将脚本可视化。无论你的工作流程如何，这个技能都能在创意与执行之间架起桥梁。

提示词路由与模型调度

你提交的每个文本提示词都会被解析，分析场景复杂度、运动描述和风格标签，然后分派到最优的扩散模型管道进行渲染。

用户说...	操作	跳过SSE？
export / 导出 / download / send me the video	→ §3.5 导出	✅
credits / 积分 / balance / 余额

→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容（生成、编辑、添加背景音乐等） | → §3.1 SSE | ❌ |

视频合成API参考

所有视频生成任务通过异步云渲染后端在分布式GPU集群上运行，帧合成、时间一致性处理和输出编码均在服务器端完成。生成的视频资产存储在安全的会话存储桶中，并在任务完成后通过签名CDN URL交付。

技能归属——运行时从此文件的YAML前置元数据读取：

- X-Skill-Source：text-to-video-generator-ai
X-Skill-Version：来自前置元数据version
X-Skill-Platform：从安装路径检测（~/.clawhub/ → clawhub，~/.cursor/skills/ → cursor，其他则为unknown）

所有请求必须包含：Authorization: Bearer 、X-Skill-Source、X-Skill-Version、X-Skill-Platform。缺少归属标头将导致导出失败，返回402错误。

API基础地址：https://mega-api-prod.nemovideo.ai

创建会话：POST /api/tasks/me/with-session/nemoagent — 请求体{taskname:project,language:<语言>} — 返回taskid、sessionid。

发送消息（SSE）：POST /runsse — 请求体{appname:nemoagent,userid:me,sessionid:,newmessage:{parts:[{text:<消息>}]}}，包含Accept: text/event-stream。最大超时时间：15分钟。

上传：POST /api/upload-video/nemoagent/me/ — 文件：multipart格式-F files=@/路径，或URL：{urls:[],sourcetype:url}

积分查询：GET /api/credits/balance/simple — 返回available、frozen、total

会话状态：GET /api/state/nemoagent/me//latest — 关键字段：data.state.draft、data.state.videoinfos、data.state.generated_media

导出（免费，不消耗积分）：POST /api/render/proxy/lambda — 请求体{id:render_<时间戳>,sessionId:,draft:,output:{format:mp4,quality:high}}。每30秒轮询GET /api/render/proxy/lambda/，直到status = completed。下载URL位于output.url。

支持的格式：mp4、mov、avi、webm、mkv、jpg、png、gif、webp、mp3、wav、m4a、aac。

SSE事件处理

事件	操作
文本响应	应用GUI翻译（§4），呈现给用户
工具调用/结果

约30%的编辑操作在SSE流中不返回文本。发生这种情况时：轮询会话状态以验证编辑是否已应用，然后向用户总结更改内容。

后端响应翻译

后端假定存在GUI。将这些翻译为API操作：

后端说	你执行
click [button] / 点击	通过API执行
open [panel] / 打开

草稿字段映射：t=轨道，tt=轨道类型（0=视频，1=音频，7=文本），sg=片段，d=时长（毫秒），m=元数据。

时间线（3条轨道）：1. 视频：城市延时摄影（0-10秒）2. 背景音乐：Lo-fi（0-10秒，35%）3. 标题：都市梦想（0-3秒）

错误处理

代码	含义	操作
0	成功	继续
1001

快速入门指南

生成你的第一个视频很简单。首先编写一个清晰、具体的文本提示词，描述你希望视频展示的内容。包括主题、视觉基调

text-to-video-generator-ai文本转视频AI

text-to-video-generator-ai

Getting Started

First-Time Connection

From Words on a Page to Video That Moves

Prompt Routing and Model Dispatch

Video Synthesis API Reference

SSE Event Handling

Backend Response Translation

Error Handling

Quick Start Guide

Performance Notes

Best Practices

开始使用

首次连接

从文字到动态视频

提示词路由与模型调度

视频合成API参考

SSE事件处理

后端响应翻译

错误处理

快速入门指南

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

text-to-video-generator-ai文本转视频AI

text-to-video-generator-ai

Getting Started

First-Time Connection

From Words on a Page to Video That Moves

Prompt Routing and Model Dispatch

Video Synthesis API Reference

SSE Event Handling

Backend Response Translation

Error Handling

Quick Start Guide

Performance Notes

Best Practices

开始使用

首次连接

从文字到动态视频

提示词路由与模型调度

视频合成API参考

SSE事件处理

后端响应翻译

错误处理

快速入门指南

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement