ACE-Step Songwriting Guide
Professional music creation knowledge for writing captions, lyrics, and choosing music parameters for ACE-Step.
Output Format
After using this guide, produce two things for the acestep skill:
- 1. Caption (
-c): Style/genre/instruments/emotion description - Lyrics (
-l): Complete structured lyrics with tags - Parameters:
--duration, --bpm, --key, --time-signature, INLINECODE6
Caption: The Most Important Input
Caption is the most important factor affecting generated music.
Supports multiple formats: simple style words, comma-separated tags, complex natural language descriptions.
Common Dimensions
| Dimension | Examples |
|---|
| Style/Genre | pop, rock, jazz, electronic, hip-hop, R&B, folk, classical, lo-fi, synthwave |
| Emotion/Atmosphere |
melancholic, uplifting, energetic, dreamy, dark, nostalgic, euphoric, intimate |
|
Instruments | acoustic guitar, piano, synth pads, 808 drums, strings, brass, electric bass |
|
Timbre Texture | warm, bright, crisp, muddy, airy, punchy, lush, raw, polished |
|
Era Reference | 80s synth-pop, 90s grunge, 2010s EDM, vintage soul, modern trap |
|
Production Style | lo-fi, high-fidelity, live recording, studio-polished, bedroom pop |
|
Vocal Characteristics | female vocal, male vocal, breathy, powerful, falsetto, raspy, choir |
|
Speed/Rhythm | slow tempo, mid-tempo, fast-paced, groovy, driving, laid-back |
|
Structure Hints | building intro, catchy chorus, dramatic bridge, fade-out ending |
Caption Writing Principles
- 1. Specific beats vague — "sad piano ballad with female breathy vocal" > "a sad song"
- Combine multiple dimensions — style+emotion+instruments+timbre anchors direction precisely
- Use references well — "in the style of 80s synthwave" conveys complex aesthetic quickly
- Texture words are useful — warm, crisp, airy, punchy influence mixing and timbre
- Don't pursue perfection — Caption is a starting point, iterate based on results
- Granularity determines freedom — Less detail = more model creativity; more detail = more control
- Avoid conflicting words — "classical strings" + "hardcore metal" degrades output
-
Fix: Repetition reinforcement — Repeat the elements you want more
-
Fix: Conflict to evolution — "Start with soft strings, middle becomes metal rock, end turns to hip-hop"
- 8. Don't put BPM/key/tempo in Caption — Use dedicated parameters instead
Lyrics: The Temporal Script
Lyrics controls how music unfolds over time. It carries:
- - Lyric text itself
- Structure tags ([Verse], [Chorus], [Bridge]...)
- Vocal style hints ([raspy vocal], [whispered]...)
- Instrumental sections ([guitar solo], [drum break]...)
- Energy changes ([building energy], [explosive drop]...)
Structure Tags
| Category | Tag | Description |
|---|
| Basic Structure | INLINECODE7 | Opening, establish atmosphere |
|
[Verse] /
[Verse 1] | Verse, narrative progression |
| |
[Pre-Chorus] | Pre-chorus, build energy |
| |
[Chorus] | Chorus, emotional climax |
| |
[Bridge] | Bridge, transition or elevation |
| |
[Outro] | Ending, conclusion |
|
Dynamic Sections |
[Build] | Energy gradually rising |
| |
[Drop] | Electronic music energy release |
| |
[Breakdown] | Reduced instrumentation, space |
|
Instrumental |
[Instrumental] | Pure instrumental, no vocals |
| |
[Guitar Solo] | Guitar solo |
| |
[Piano Interlude] | Piano interlude |
|
Special |
[Fade Out] | Fade out ending |
| |
[Silence] | Silence |
Combining Tags
Use - for finer control, but keep it concise:
CODEBLOCK0
Put complex style descriptions in Caption, not in tags.
Caption-Lyrics Consistency
Models are not good at resolving conflicts. Checklist:
- - Instruments in Caption ↔ Instrumental section tags in Lyrics
- Emotion in Caption ↔ Energy tags in Lyrics
- Vocal description in Caption ↔ Vocal control tags in Lyrics
Vocal Control Tags
| Tag | Effect |
|---|
| INLINECODE23 | Raspy, textured vocals |
| INLINECODE24 |
Whispered |
|
[falsetto] | Falsetto |
|
[powerful belting] | Powerful, high-pitched singing |
|
[spoken word] | Rap/recitation |
|
[harmonies] | Layered harmonies |
|
[call and response] | Call and response |
|
[ad-lib] | Improvised embellishments |
Energy and Emotion Tags
| Tag | Effect |
|---|
| INLINECODE31 | High energy, passionate |
| INLINECODE32 |
Low energy, restrained |
|
[building energy] | Increasing energy |
|
[explosive] | Explosive energy |
|
[melancholic] | Melancholic |
|
[euphoric] | Euphoric |
|
[dreamy] | Dreamy |
|
[aggressive] | Aggressive |
Lyric Writing Tips
- 1. 6-10 syllables per line — Model aligns syllables to beats; keep similar counts for lines in same position (±1-2)
- Uppercase = stronger intensity —
WE ARE THE CHAMPIONS! (shouting) vs walking through the streets (normal) - Parentheses = background vocals — INLINECODE41
- Extend vowels —
Feeeling so aliiive (use cautiously, effects unstable) - Clear section separation — Blank lines between sections
Avoiding "AI-flavored" Lyrics
| Red Flag | Description |
|---|
| Adjective stacking | "neon skies, electric hearts, endless dreams" — vague imagery filler |
| Rhyme chaos |
Inconsistent patterns or forced rhymes breaking meaning |
|
Blurred boundaries | Lyric content crosses structure tags |
|
No breathing room | Lines too long to sing in one breath |
|
Mixed metaphors | Water → fire → flying — listeners can't anchor |
Metaphor discipline: One core metaphor per song, explore its multiple aspects.
Music Metadata
Most of the time, let LM auto-infer. Only set manually when you have clear requirements.
| Parameter | Range | Description |
|---|
| INLINECODE43 | 30–300 | Slow 60–80, mid 90–120, fast 130–180 |
| INLINECODE44 |
Key | e.g.
C Major,
Am. Common keys (C, G, D, Am, Em) most stable |
|
timesignature | Time sig |
4/4 (most common),
3/4 (waltz),
6/8 (swing) |
|
vocal_language | Language | Usually auto-detected from lyrics |
|
duration | Seconds | See duration calculation below |
When to Set Manually
| Scenario | Set |
|---|
| Daily generation | Let LM auto-infer |
| Clear tempo requirement |
bpm |
| Specific style (waltz) |
timesignature=3/4 |
| Match other material |
bpm +
duration |
| Specific key color |
keyscale |
Duration Calculation
Estimation Method
- - Intro/Outro: 5-10 seconds each
- Instrumental sections: 5-15 seconds each
- Typical structures:
- 2 verses + 2 choruses: 120-150s minimum
- 2 verses + 2 choruses + bridge: 180-240s minimum
- Full song with intro/outro: 210-270s (3.5-4.5 min)
BPM and Duration Relationship
- - Slower BPM (60-80): Need MORE duration for same lyrics
- Medium BPM (100-130): Standard duration
- Faster BPM (150-180): Can fit more lyrics, but still need breathing room
Rule of thumb: When in doubt, estimate longer. A song too short feels rushed.
Note: Lyrics tags (piano, powerful, whispered) are consistent with Caption (piano ballad, building to powerful chorus, intimate).
ACE-Step 歌曲创作指南
为ACE-Step撰写说明文字、歌词及选择音乐参数的专业音乐创作知识。
输出格式
使用本指南后,为acestep技能生成三项内容:
- 1. 说明文字(-c):风格/流派/乐器/情感描述
- 歌词(-l):带标签的完整结构化歌词
- 参数:--duration、--bpm、--key、--time-signature、--language
说明文字:最重要的输入
说明文字是影响生成音乐的最重要因素。
支持多种格式:简单的风格词汇、逗号分隔的标签、复杂的自然语言描述。
常见维度
| 维度 | 示例 |
|---|
| 风格/流派 | 流行、摇滚、爵士、电子、嘻哈、R&B、民谣、古典、低保真、合成波 |
| 情感/氛围 |
忧郁、振奋、充满活力、梦幻、黑暗、怀旧、狂喜、亲密 |
|
乐器 | 原声吉他、钢琴、合成器垫音、808鼓、弦乐、铜管、电贝司 |
|
音色质感 | 温暖、明亮、清脆、浑浊、空灵、有力、丰满、原始、精致 |
|
时代参考 | 80年代合成波、90年代垃圾摇滚、2010年代电子舞曲、复古灵魂乐、现代陷阱 |
|
制作风格 | 低保真、高保真、现场录音、录音室精制、卧室流行 |
|
人声特点 | 女声、男声、气声、有力、假声、沙哑、合唱 |
|
速度/节奏 | 慢速、中速、快节奏、律动感、推进感、悠闲 |
|
结构提示 | 渐入前奏、抓耳副歌、戏剧性桥段、渐弱结尾 |
说明文字撰写原则
- 1. 具体优于模糊——带有气声女声的悲伤钢琴叙事曲 > 一首悲伤的歌
- 组合多个维度——风格+情感+乐器+音色能精确锚定方向
- 善用参考——80年代合成波风格能快速传达复杂审美
- 质感词汇很实用——温暖、清脆、空灵、有力会影响混音和音色
- 不要追求完美——说明文字是起点,根据结果迭代调整
- 粒度决定自由度——细节越少=模型创意越多;细节越多=控制越强
- 避免冲突词汇——古典弦乐+硬核金属会降低输出质量
-
修复:重复强化——重复你想要更多的元素
-
修复:冲突变演进——以柔和弦乐开始,中间变为金属摇滚,结尾转为嘻哈
- 8. 不要在说明文字中放入BPM/调性/速度——使用专用参数代替
歌词:时间脚本
歌词控制音乐如何随时间展开。它承载:
- - 歌词文本本身
- 结构标签([主歌]、[副歌]、[桥段]...)
- 人声风格提示([沙哑人声]、[低语]...)
- 器乐段落([吉他独奏]、[鼓点中断]...)
- 能量变化([能量渐强]、[爆发式骤降]...)
结构标签
[主歌] / [主歌1] | 主歌,叙事推进 |
| | [导歌] | 导歌,积蓄能量 |
| | [副歌] | 副歌,情感高潮 |
| | [桥段] | 桥段,过渡或升华 |
| | [尾奏] | 结尾,收束 |
|
动态段落 | [渐强] | 能量逐渐上升 |
| | [骤降] | 电子音乐能量释放 |
| | [分解] | 乐器减少,留白空间 |
|
器乐 | [器乐] | 纯器乐,无人声 |
| | [吉他独奏] | 吉他独奏 |
| | [钢琴间奏] | 钢琴间奏 |
|
特殊 | [渐弱] | 渐弱结尾 |
| | [静默] | 静默 |
组合标签
使用-进行更精细的控制,但保持简洁:
✅ [副歌 - 颂歌式]
❌ [副歌 - 颂歌式 - 叠层和声 - 高能量 - 有力 - 史诗感]
将复杂的风格描述放在说明文字中,而不是标签里。
说明文字与歌词的一致性
模型不擅长解决冲突。 检查清单:
- - 说明文字中的乐器 ↔ 歌词中的器乐段落标签
- 说明文字中的情感 ↔ 歌词中的能量标签
- 说明文字中的人声描述 ↔ 歌词中的人声控制标签
人声控制标签
低语 |
| [假声] | 假声 |
| [强力高音] | 有力、高亢的演唱 |
| [念白] | 说唱/朗诵 |
| [和声] | 叠层和声 |
| [呼应] | 呼应对唱 |
| [即兴] | 即兴装饰音 |
能量与情感标签
低能量,克制 |
| [能量渐强] | 能量递增 |
| [爆发] | 爆发性能量 |
| [忧郁] | 忧郁 |
| [狂喜] | 狂喜 |
| [梦幻] | 梦幻 |
| [激进] | 激进 |
歌词撰写技巧
- 1. 每行6-10个音节——模型会将音节对齐节拍;保持同一位置行数相近(±1-2)
- 大写=更强力度——我们是冠军!(呐喊)vs 走在街上(正常)
- 括号=背景人声——我们一同崛起(一同)
- 延长元音——感觉如此自~由~(谨慎使用,效果不稳定)
- 清晰的段落分隔——段落之间空行
避免AI味歌词
| 警示信号 | 描述 |
|---|
| 形容词堆砌 | 霓虹天空,电子心脏,无尽梦想——模糊意象填充 |
| 押韵混乱 |
不规律的押韵模式或强行押韵破坏意义 |
|
界限模糊 | 歌词内容跨越结构标签 |
|
没有呼吸空间 | 行太长,一口气唱不完 |
|
混合隐喻 | 水→火→飞翔——听众无法锚定 |
隐喻纪律:每首歌一个核心隐喻,探索其多个方面。
音乐元数据
大多数情况下,让语言模型自动推断。 只有在有明确要求时才手动设置。
| 参数 | 范围 | 描述 |
|---|
| bpm | 30–300 | 慢速60–80,中速90–120,快速130–180 |
| keyscale |
调性 | 例如C大调、Am。常用调性(C、G、D、Am、Em)最稳定 |
| timesignature | 拍号 | 4/4(最常见)、3/4(华尔兹)、6/8(摇摆) |
| vocal_language | 语言 | 通常从歌词自动检测 |
| duration | 秒 | 见下方时长计算 |
何时手动设置
bpm |
| 特定风格(华尔兹) | timesignature=3/4 |
| 匹配其他素材 | bpm + duration |
| 特定调性色彩 | keyscale |
时长计算
估算方法
- - 前奏/尾奏:各5-10秒
- 器乐段落:各5-15秒
- 典型结构:
- 2段主歌+2段副歌:至少120-150秒
- 2段主歌+2段副歌+桥段:至少180-240秒
- 带前奏/