Video Audio Replace

Replace a video's original audio with TTS-generated voice while maintaining precise timing alignment. Also supports generating subtitles from video using Whisper.

Full Workflow

Step 1: Generate subtitles from video (optional)

If you don't have an SRT file, generate one from the video using the included script:

CODEBLOCK0

Or manually with Python:

CODEBLOCK1

Step 2: Replace audio with TTS

Use the generated SRT to create a new video with TTS voice.

When to use

- Dubbing videos with AI-generated voice
Converting subtitle files to voice-over
Creating multilingual video versions

Requirements

API Keys (choose one)

- ElevenLabs: Set ELEVENLABS_API_KEY environment variable
Edge TTS (free, no key needed): Use INLINECODE1

System dependencies

- ffmpeg
sox (optional, for advanced processing)

Usage

Basic usage (ElevenLabs)

CODEBLOCK2

Using Edge TTS (free, no API key)

CODEBLOCK3

Options

Option	Description	Default
INLINECODE2	Input video file	Required
INLINECODE3

Examples

English voice (ElevenLabs)

CODEBLOCK4

Chinese voice (Edge TTS)

CODEBLOCK5

How it works

1. Extract original audio from video
Split audio into segments based on subtitle timestamps
Generate TTS audio for each subtitle segment
Adjust TTS speed (within 0.85-1.15x) to match original segment duration
Add silence padding to fill any remaining time gap
Merge all segments preserving original timing gaps
Replace video audio with aligned TTS audio

Available Voices

ElevenLabs (require API key)

- Liam - Energetic male (recommended)
INLINECODE9 - Professional female
INLINECODE10 - Deep resonant male
Run curl with your API key to list all voices

Edge TTS (free)

- Chinese: zh-CN-XiaoxiaoNeural, zh-CN-YunxiNeural, INLINECODE14
English: en-US-JennyNeural, INLINECODE16
Many more languages available

视频音频替换

将视频原始音频替换为TTS生成的语音，同时保持精确的时间对齐。还支持使用Whisper从视频生成字幕。

完整工作流程

步骤1：从视频生成字幕（可选）

如果没有SRT文件，可使用附带的脚本从视频生成字幕：

bash

从视频生成字幕（使用faster-whisper，免费，本地运行）

generate_subtitles.py video.mp4 -o subtitles.srt -l zh

或手动使用Python：

bash

使用faster-whisper（推荐，本地运行，免费）

pip install faster-whisper srt

python3 << EOF
from faster_whisper import WhisperModel
import srt
from datetime import timedelta

model = WhisperModel(base, device=cpu, compute_type=int8)
segments, info = model.transcribe(input_video.mp4, language=zh)

生成SRT

def format_time(seconds): td = timedelta(seconds=seconds) return f{td.seconds//3600:02d}:{(td.seconds%3600)//60:02d}:{td.seconds%60:02d},{td.microseconds//1000:03d}

srt_content =
for i, seg in enumerate(segments, 1):
start = format_time(seg.start)
end = format_time(seg.end)
srt_content += f{i}\n{start} --> {end}\n{seg.text.strip()}\n\n

with open(subtitles.srt, w, encoding=utf-8) as f:
f.write(srt_content)
EOF

步骤2：使用TTS替换音频

使用生成的SRT文件创建带有TTS语音的新视频。

适用场景

- 使用AI生成语音为视频配音
将字幕文件转换为配音
创建多语言视频版本

系统要求

API密钥（任选其一）

- ElevenLabs：设置ELEVENLABSAPIKEY环境变量
Edge TTS（免费，无需密钥）：使用--engine edge

系统依赖

- ffmpeg
sox（可选，用于高级处理）

使用方法

基本用法（ElevenLabs）

bash video-audio-replace --video input.mp4 --srt subtitles.srt --output output.mp4 --voice Liam

使用Edge TTS（免费，无需API密钥）

bash video-audio-replace --video input.mp4 --srt subtitles.srt --output output.mp4 --engine edge --voice zh-CN-YunxiNeural

选项

选项	描述	默认值
--video	输入视频文件	必需
--srt

示例

英语语音（ElevenLabs）

bash video-audio-replace --video 2028.mp4 --srt 2028.srt --output 2028_final.mp4 --voice Liam

中文语音（Edge TTS）

bash video-audio-replace --video video.mp4 --srt subs.srt --output result.mp4 --engine edge --voice zh-CN-YunxiNeural

工作原理

1. 从视频中提取原始音频
根据字幕时间戳将音频分割成片段
为每个字幕片段生成TTS音频
调整TTS速度（在0.85-1.15倍范围内）以匹配原始片段时长
添加静音填充以填补剩余时间间隙
合并所有片段，保留原始时间间隔
用对齐后的TTS音频替换视频音频

video-audio-replace视频音频替换