Dialogue Audio
Create realistic multi-speaker dialogue with Dia TTS via inference.sh CLI.
Quick Start
CODEBLOCK0
Install note: The install script only detects your OS/architecture, downloads the matching binary from dist.inference.sh, and verifies its SHA-256 checksum. No elevated permissions or background processes. Manual install & verification available.
Speaker Tags
Dia TTS uses [S1] and [S2] to distinguish two speakers.
| Tag | Role | Voice |
|---|
| INLINECODE3 | Speaker 1 | Automatically assigned voice A |
| INLINECODE4 |
Speaker 2 | Automatically assigned voice B |
Rules:
- - Always start each speaker turn with the tag
- Tags must be uppercase:
[S1] not INLINECODE6 - Maximum 2 speakers per generation
- Each speaker maintains consistent voice within a session
Emotion & Expression Control
Dia TTS interprets punctuation and non-speech cues for emotional delivery.
Punctuation Effects
| Punctuation | Effect | Example |
|---|
| INLINECODE7 | Neutral, declarative, medium pause | "This is important." |
| INLINECODE8 |
Emphasis, excitement, energy | "This is amazing!" |
|
? | Rising intonation, questioning | "Are you sure about that?" |
|
... | Hesitation, trailing off, long pause | "I thought it would work... but it didn't." |
|
, | Short breath pause | "First, we analyze. Then, we act." |
|
— or
-- | Interruption or pivot | "I was going to say — never mind." |
Non-Speech Sounds
Dia TTS supports parenthetical sound descriptions:
CODEBLOCK1
Examples with Emotion
CODEBLOCK2
Pacing Control
Pause Hierarchy
| Technique | Pause Length | Use For |
|---|
| Comma INLINECODE14 | ~0.3 seconds | Between clauses, list items |
| Period INLINECODE15 |
~0.5 seconds | Between sentences |
| Ellipsis
... | ~1.0 seconds | Dramatic pause, thinking, hesitation |
| New speaker tag | ~0.3 seconds | Natural turn-taking gap |
Speed Control
- - Shorter sentences = faster perceived pace
- Longer sentences with commas = measured, thoughtful pace
- Questions followed by answers = engaging back-and-forth rhythm
CODEBLOCK3
Conversation Structure Patterns
Interview Format
CODEBLOCK4
Tutorial / Explainer
CODEBLOCK5
Debate / Discussion
CODEBLOCK6
Post-Production Tips
Volume Normalization
Both speakers should be at consistent volume. If one is louder:
CODEBLOCK7
Adding Background/Music
CODEBLOCK8
Segmenting Long Conversations
For conversations longer than ~30 seconds, generate in segments:
CODEBLOCK9
Script Writing Tips
| Do | Don't |
|---|
| Write how people talk | Write how people write |
| Short sentences (< 15 words) |
Long academic sentences |
| Contractions ("can't", "won't") | Formal ("cannot", "will not") |
| Natural fillers ("So,", "Well,") | Every sentence perfectly formed |
| Vary sentence length | All sentences same length |
| Include reactions ("Exactly!", "Hmm.") | One-sided monologues |
| Read it aloud before generating | Assume it sounds right |
Common Mistakes
| Mistake | Problem | Fix |
|---|
| Monologues longer than 3 sentences | Sounds like a lecture, not conversation | Break into exchanges |
| No emotional variation |
Flat, robotic delivery | Use punctuation and non-speech cues |
| Missing speaker tags | Voices don't alternate | Start every turn with
[S1] or
[S2] |
| Formal written language | Sounds unnatural spoken | Use contractions, short sentences |
| No pauses between topics | Feels rushed | Use
... or scene breaks |
| All same energy level | Monotonous | Vary between high/low energy moments |
Related Skills
CODEBLOCK10
Browse all apps: INLINECODE20
技能名称: dialogue-audio
详细描述:
对话音频
通过 inference.sh CLI 使用 Dia TTS 创建逼真的多说话人对话。
快速开始
bash
curl -fsSL https://cli.inference.sh | sh && infsh login
双人对话
infsh app run falai/dia-tts --input {
prompt: [S1] 你试过那个新功能了吗?[S2] 还没,但我听说它能省下大把时间。[S1] 确实如此,我的工作流程缩短了一半。[S2] 好吧,我今天一定试试。
}
安装说明: 安装脚本 仅检测您的操作系统/架构,从 dist.inference.sh 下载匹配的二进制文件,并验证其 SHA-256 校验和。无需提升权限或后台进程。也提供 手动安装与验证。
说话人标签
Dia TTS 使用 [S1] 和 [S2] 来区分两个说话人。
| 标签 | 角色 | 声音 |
|---|
| [S1] | 说话人 1 | 自动分配声音 A |
| [S2] |
说话人 2 | 自动分配声音 B |
规则:
- - 每个说话人轮次始终以标签开头
- 标签必须大写:[S1] 而非 [s1]
- 每次生成最多 2 个说话人
- 每个说话人在同一会话中保持声音一致
情感与表达控制
Dia TTS 通过标点符号和非语言提示来诠释情感表达。
标点符号效果
强调、兴奋、活力 | 太棒了! |
| ? | 升调、疑问 | 你确定吗? |
| ... | 犹豫、声音渐弱、长停顿 | 我以为会成功……但并没有。 |
| , | 短暂呼吸停顿 | 首先,我们分析。然后,我们行动。 |
| — 或 -- | 打断或转折 | 我正要说——算了。 |
非语言声音
Dia TTS 支持括号内的声音描述:
(laughs) — 笑声
(sighs) — 恼怒或宽慰
(clears throat) — 引起注意的停顿
(whispers) — 更轻柔的表达
(gasps) — 惊讶
带情感的示例
bash
兴奋的对话
infsh app run falai/dia-tts --input {
prompt: [S1] 猜猜今天发生了什么![S2] 什么?快告诉我![S1] 我们达到了一万用户![S2] (gasps) 不会吧!太不可思议了![S1] 我知道……我到现在还不敢相信。
}
严肃/深思的对话
infsh app run falai/dia-tts --input {
prompt: [S1] 我们需要谈谈时间表。[S2] (sighs) 我知道。时间很紧。[S1] 我们能从范围中砍掉什么吗?[S2] 也许吧……但那就意味着要放弃分析仪表盘。[S1] 这是个艰难的取舍。
}
教学/解释
infsh app run falai/dia-tts --input {
prompt: [S1] 那么它到底是怎么工作的?[S2] 好问题。把它想象成一条流水线。数据从一端进入,在中间处理,然后从另一端转换输出。[S1] 就像装配线?[S2] 完全正确!每一步都增加了新的东西。
}
节奏控制
停顿层级
| 技巧 | 停顿时长 | 用途 |
|---|
| 逗号 , | ~0.3 秒 | 从句之间、列表项之间 |
| 句号 . |
~0.5 秒 | 句子之间 |
| 省略号 ... | ~1.0 秒 | 戏剧性停顿、思考、犹豫 |
| 新说话人标签 | ~0.3 秒 | 自然的轮换间隙 |
语速控制
- - 较短的句子 = 感知节奏更快
- 带逗号的长句 = 沉稳、深思的节奏
- 问题后接答案 = 引人入胜的来回节奏
bash
快节奏、充满活力
infsh app run falai/dia-tts --input {
prompt: [S1] 准备好了吗?[S2] 准备好了。[S1] 我们开始吧!三个功能。五分钟。[S2] 开始![S1] 功能一:实时同步。
}
缓慢、沉思
infsh app run falai/dia-tts --input {
prompt: [S1] 我想这件事已经有一阵子了……我觉得我们需要改变方向。[S2] 你什么意思?[S1] 市场已经变了。去年管用的方法……现在不管用了。
}
对话结构模式
采访格式
bash
infsh app run falai/dia-tts --input {
prompt: [S1] 欢迎来到节目。今天我们有一位特邀嘉宾。请介绍一下你自己。[S2] 谢谢邀请!我是一名产品设计师,为创作者构建工具已经有大约十年了。[S1] 是什么让你开始做设计的?[S2] 说实话?我编程很烂,但喜欢让东西看起来好看。(laughs) 所以设计是自然而然的路。
}
教程/讲解
bash
infsh app run falai/dia-tts --input {
prompt: [S1] 你能带我走一遍设置流程吗?[S2] 当然。第一步,安装 CLI。大约需要三十秒。[S1] 然后呢?[S2] 第二步,运行登录命令。它会打开你的浏览器进行身份验证。[S1] 听起来很简单。[S2] 是的!第三步,你就可以运行第一个应用了。
}
辩论/讨论
bash
infsh app run falai/dia-tts --input {
prompt: [S1] 我认为我们应该选择方案A。实现起来更快。[S2] 但方案B长期来看扩展性更好。[S1] 没错,但我们需要在本季度交付一些东西。[S2] 有道理……如果我们先做A,同时规划迁移到B的路径呢?[S1] 这可行。我们做个原型吧。
}
后期制作技巧
音量标准化
两个说话人的音量应保持一致。如果其中一个声音较大:
bash
合并并平衡音频
infsh app run infsh/video-audio-merger --input {
video: talking-head.mp4,
audio: dialogue.mp3,
audio_volume: 1.0
}
添加背景/音乐
bash
将对话与背景音乐合并
infsh app run infsh/media-merger --input {
media: [dialogue.mp3, background-music.mp3]
}
分割长对话
对于超过约 30 秒的对话,分段生成:
bash
第1段:介绍
infsh app run falai/dia-tts --input {
prompt: [S1] 欢迎回到新的一期节目……
}
第2段:主要内容
infsh app run falai/dia-tts --input {
prompt: [S1] 那么让我们深入今天的话题……
}
第3段:结尾
infsh app run falai/dia-tts --input {
prompt: [S1] 今天的对话很棒……
}
合并所有段落
infsh app run infsh/media-merger --input {
media: [segment1.mp3, segment2.mp3, segment3.mp3]
}
脚本写作技巧
| 应该做 | 不应该做 |
|---|
| 按人们说话的方式写 | 按人们写作的方式写 |
| 短句(少于 15 个词) |
冗长的学术句子 |
| 缩略形式(cant, wont) | 正式形式(cannot, will not) |
| 自然的填充词(So,, Well,) | 每个句子都完美无缺 |
| 句子长度多样化 | 所有句子长度相同 |
|