Auto Subtitle Video — Add Subtitles to Video Automatically
You have a video. It needs subtitles. That is the entire problem, and it should take less time to solve than it took to read this sentence — but traditional subtitle workflows make it feel like filing taxes: transcribe the audio by ear (5-10x the video duration), time each caption to the exact word (another 2-3x), format the SRT file without breaking the timestamp syntax (30 minutes of debugging colons vs commas), choose a font that's readable on every background in the video (15 minutes of second-guessing), position the captions where they won't be hidden by platform UI (knowing that TikTok, YouTube, and Instagram all have different safe zones), and render — hoping the export settings don't break the subtitle encoding. NemoVideo replaces this entire workflow with a single action: upload a video, receive it back with subtitles. The AI handles transcription (98% accuracy across 90+ languages), timing (word-level precision — each word synced to the exact millisecond it's spoken), styling (platform-appropriate fonts, colors, and positioning), and rendering (burned into the video or exported as SRT/VTT sidecar). The creator's only job is reviewing the output and clicking publish.
Use Cases
- 1. Quick Subtitle — Upload and Done (any length) — A creator finishes editing a 60-second Reel and needs captions before posting. NemoVideo processes the video in seconds: transcribes, generates word-by-word animated captions in bold white with black outline, positions in the Instagram safe zone, and returns the captioned video ready to upload. Zero configuration needed — the defaults are optimized for social media.
- YouTube Tutorial with Clean Subtitles (10-30 min) — A coding tutorial needs professional captions that don't distract from the screen share. NemoVideo generates: smaller font (36px), semi-transparent dark background bar, positioned at the bottom but not overlapping the code editor, with technical terminology handled accurately (function names, library names, error messages transcribed correctly). SRT exported alongside for YouTube's closed-caption system.
- Interview with Speaker Labels (5-20 min) — A two-person interview for a company blog. NemoVideo detects both speakers by voice, labels captions ("Sarah, CEO:" / "Interviewer:"), and color-codes each speaker's text. The viewer always knows who is speaking even when both speakers are off-screen during B-roll cutaways.
- Social Media Batch — 10 Videos at Once — A social media manager has 10 short-form videos due this week. NemoVideo batch-processes all 10: consistent caption styling across the batch (same font, color, position), individual SRT files for each, and burned-in versions ready for scheduling. What would take 3-4 hours of manual captioning is done while the manager works on something else.
- Event Keynote — Multilingual Captions (30-60 min) — A tech conference publishes speaker recordings. NemoVideo transcribes the English keynote and generates subtitle tracks in English, Spanish, Mandarin, Japanese, and French. Each language exported as both burned-in video (for social media clips) and SRT (for the conference's video-on-demand platform with language switching).
How It Works
Step 1 — Upload Video
Drag and drop or provide a URL. Any format, any duration. NemoVideo detects the language automatically.
Step 2 — Customize (Optional)
The defaults work for most social media use cases. Customize if you need: specific font, custom colors, translation, speaker labels, or sidecar-only export.
Step 3 — Generate
CODEBLOCK0
Step 4 — Review and Post
Preview. NemoVideo flags any low-confidence words for quick correction. Export and upload.
Parameters
| Parameter | Type | Required | Description |
|---|
| INLINECODE0 | string | ✅ | Describe the video and caption preferences |
| INLINECODE1 |
boolean | | Auto-detect spoken language (default: true) |
|
language | string | | Force language: "en", "es", "fr", "de", "ja", "zh" |
|
translate_to | array | | Target languages for translation |
|
style | string | | "word-highlight", "full-sentence", "clean-bar", "minimal", "karaoke" |
|
font | string | | "bold-sans", "helvetica-light", "monospace", "serif" |
|
font_size | integer | | Size in pixels (default: 44) |
|
text_color | string | | Base text hex (default: "#FFFFFF") |
|
highlight_color | string | | Active word hex (default: "#FBBF24") |
|
position | string | | "bottom-20", "bottom-center", "center", "top" |
|
filler_filter | boolean | | Remove um/uh/like (default: false) |
|
speaker_labels | boolean | | Identify speakers (default: auto) |
|
burn_in | boolean | | Render into video (default: true) |
|
srt_export | boolean | | Export SRT sidecar (default: true) |
|
batch | boolean | | Process multiple videos with consistent styling (default: false) |
|
format | string | | "16:9", "9:16", "1:1" |
Output Example
CODEBLOCK1
Tips
- 1. The default settings are optimized for social media — Bold white text, black outline, bottom-safe-zone positioning, word-by-word highlight. If you're posting to TikTok, Reels, or Shorts, the defaults produce professional results without any customization.
- Batch processing saves hours for content teams — 10 videos with consistent styling, processed simultaneously. Every caption uses the same font, size, color, and position — visual brand consistency across all content.
- Filler word removal makes speakers sound polished — "Um," "uh," "you know," and "like" in captions are more noticeable than in audio. Removing them makes the speaker appear more articulate without changing what the viewer hears.
- Semi-transparent bar for busy backgrounds — When the video has varying backgrounds (outdoor footage, screen shares, product shots), a dark semi-transparent bar behind the text ensures readability everywhere. Outline-only captions disappear on bright scenes.
- Always export SRT alongside burned-in — Social platforms can index SRT text for search discovery. Burned-in ensures the viewer sees captions regardless of settings. Both serve different purposes — generate both with one command.
Output Formats
| Format | Description | Use Case |
|---|
| MP4 (burned-in) | Captions rendered into video pixels | Social media direct upload |
| SRT |
Time-coded subtitle file | YouTube / LinkedIn / LMS upload |
| VTT | Web Video Text Tracks | Website player / accessibility |
| JSON | Word-level transcript + timestamps | Developer integration / search |
Related Skills
技能名称:auto-subtitle-video
详细描述:
自动字幕视频 — 自动为视频添加字幕
你有一个视频,它需要字幕。这就是全部问题,解决它所需的时间应该比读完这句话还短——但传统的字幕工作流程却让人感觉像在报税:凭耳朵转录音频(耗时是视频时长的5-10倍),将每条字幕精确对齐到每个单词(再花2-3倍时间),在不破坏时间戳语法的情况下格式化SRT文件(花30分钟调试冒号与逗号),选择一种在视频所有背景上都清晰可读的字体(花15分钟反复纠结),将字幕定位在不会被平台UI遮挡的位置(要知道TikTok、YouTube和Instagram都有不同的安全区域),最后渲染——祈祷导出设置不会破坏字幕编码。NemoVideo用一个简单操作取代了整个工作流程:上传视频,收到带字幕的视频。AI负责转录(90多种语言,准确率98%)、时间对齐(单词级精度——每个单词与说出它的精确毫秒同步)、样式设计(适合平台的字体、颜色和位置)以及渲染(嵌入视频或导出为SRT/VTT辅助文件)。创作者只需检查输出并点击发布。
使用场景
- 1. 快速字幕 — 上传即完成(任意时长) — 创作者完成一个60秒的Reel剪辑后,需要在发布前添加字幕。NemoVideo在几秒内处理视频:转录、生成逐词动画字幕(粗体白色带黑色描边)、定位在Instagram安全区域,并返回带字幕的视频即可上传。无需任何配置——默认设置已针对社交媒体优化。
- 带干净字幕的YouTube教程(10-30分钟) — 编程教程需要专业字幕,且不干扰屏幕共享内容。NemoVideo生成:较小字体(36px)、半透明深色背景条、定位在底部但不遮挡代码编辑器,并准确处理技术术语(函数名、库名、错误信息正确转录)。同时导出SRT用于YouTube的隐藏字幕系统。
- 带说话人标签的访谈(5-20分钟) — 公司博客的双人访谈。NemoVideo通过声音检测两位说话人,为字幕添加标签(Sarah, CEO: / Interviewer:),并为每位说话人的文本分配不同颜色。即使两位说话人在B-roll切换镜头时都不在画面中,观众也能始终知道谁在说话。
- 社交媒体批量处理 — 一次处理10个视频 — 社交媒体经理本周有10个短视频需要处理。NemoVideo批量处理全部10个:整个批次使用一致的字幕样式(相同字体、颜色、位置),为每个视频生成独立的SRT文件,以及准备好用于排期的嵌入版本。原本需要3-4小时手动添加字幕的工作,在经理处理其他事务时即可完成。
- 活动主题演讲 — 多语言字幕(30-60分钟) — 科技会议发布演讲者录像。NemoVideo转录英文主题演讲,并生成英语、西班牙语、普通话、日语和法语的字幕轨道。每种语言都导出为嵌入视频(用于社交媒体片段)和SRT(用于支持语言切换的会议视频点播平台)。
工作原理
第1步 — 上传视频
拖放或提供URL。任意格式,任意时长。NemoVideo自动检测语言。
第2步 — 自定义(可选)
默认设置适用于大多数社交媒体场景。如需自定义:特定字体、自定义颜色、翻译、说话人标签或仅导出辅助文件。
第3步 — 生成
bash
curl -X POST https://mega-api-prod.nemovideo.ai/api/v1/generate \
-H Authorization: Bearer $NEMO_TOKEN \
-H Content-Type: application/json \
-d {
skill: auto-subtitle-video,
prompt: 为一个2分钟的Instagram Reel添加字幕。自动检测语言(英语)。逐词高亮动画:当前词黄色(#FBBF24),基础文本白色带2px黑色描边。字体:粗体无衬线44px。位置:底部20%(Instagram安全区域)。移除填充词。嵌入视频并同时导出SRT。,
auto_detect: true,
style: word-highlight,
burn_in: true,
srt_export: true,
filler_filter: true,
format: 9:16
}
第4步 — 检查并发布
预览。NemoVideo会标记任何低置信度的单词以便快速修正。导出并上传。
参数
| 参数 | 类型 | 必填 | 描述 |
|---|
| prompt | string | ✅ | 描述视频和字幕偏好 |
| auto_detect |
boolean | | 自动检测口语语言(默认:true) |
| language | string | | 强制语言:en, es, fr, de, ja, zh |
| translate_to | array | | 翻译的目标语言 |
| style | string | | word-highlight, full-sentence, clean-bar, minimal, karaoke |
| font | string | | bold-sans, helvetica-light, monospace, serif |
| font_size | integer | | 像素大小(默认:44) |
| text_color | string | | 基础文本十六进制颜色(默认:#FFFFFF) |
| highlight_color | string | | 当前词十六进制颜色(默认:#FBBF24) |
| position | string | | bottom-20, bottom-center, center, top |
| filler_filter | boolean | | 移除嗯/啊/那个(默认:false) |
| speaker_labels | boolean | | 识别说话人(默认:auto) |
| burn_in | boolean | | 渲染到视频中(默认:true) |
| srt_export | boolean | | 导出SRT辅助文件(默认:true) |
| batch | boolean | | 使用一致样式处理多个视频(默认:false) |
| format | string | | 16:9, 9:16, 1:1 |
输出示例
json
{
job_id: asv-20260328-001,
status: completed,
duration_seconds: 122,
format: mp4,
resolution: 1080x1920,
filesizemb: 28.4,
transcription: {
language: en,
confidence: 0.979,
word_count: 312,
fillerwordsremoved: 8
},
outputs: {
burned_in: reel-subtitled.mp4,
srt: reel-en.srt
},
processingtimeseconds: 6.2
}
提示
- 1. 默认设置针对社交媒体优化 — 粗体白色文本、黑色描边、底部安全区域定位、逐词高亮。如果你发布到TikTok、Reels或Shorts,默认设置无需任何自定义即可产生专业效果。
- 批量处理为内容团队节省数小时 — 10个视频使用一致样式,同时处理。每条字幕使用相同的字体、大小、颜色和位置——所有内容实现视觉品牌一致性。
- 移除填充词让说话人听起来更专业 — 字幕中的嗯、啊、你知道和那个比在音频中更明显。移除它们能让说话人显得更清晰表达,同时不改变观众听到的内容。
- 复杂背景使用半透明条 — 当视频背景多变时(户外镜头、屏幕共享、产品拍摄),文本后的深色半透明条可确保任何位置的可读性。仅描边的字幕在明亮场景中会消失。
- 始终同时导出SRT和嵌入版本 — 社交平台可以索引SRT文本用于搜索发现。嵌入版本确保无论设置如何,观众都能看到字幕。两者用途不同——通过一个命令同时生成。
输出格式
| 格式 | 描述 | 使用场景 |
|---|
| MP4(嵌入) | 字幕渲染到视频像素中 | 社交媒体直接上传 |
| SRT |
时间编码字幕文件 | YouTube / LinkedIn / LMS上传 |
| VTT | 网络视频文本轨道 | 网站播放器 / 无障碍访问 |
| JSON | 单词级转录 + 时间戳 | 开发者集成 / 搜索 |
相关技能