🎵 Voice Note to MIDI
Transform your voice memos, humming, and melodic recordings into clean, quantized MIDI files ready for your DAW.
What It Does
This skill provides a complete audio-to-MIDI conversion pipeline that:
- 1. Stem Separation - Uses HPSS (Harmonic-Percussive Source Separation) to isolate melodic content from drums, noise, and background sounds
- ML-Powered Pitch Detection - Leverages Spotify's Basic Pitch model for accurate fundamental frequency extraction
- Key Detection - Automatically detects the musical key of your recording using Krumhansl-Kessler key profiles
- Intelligent Quantization - Snaps notes to a configurable timing grid with optional key-aware pitch correction
- Post-Processing - Applies octave pruning, overlap-based harmonic removal, and legato note merging for clean output
Pipeline Architecture
CODEBLOCK0
Setup
Prerequisites
- - Python 3.11+ (Python 3.14+ recommended)
- FFmpeg (for audio format support)
- pip
Installation
Quick Install (Recommended):
CODEBLOCK1
This automated script will:
- - Check Python 3.11+ is installed
- Create the
~/melody-pipeline directory - Set up the virtual environment
- Install all dependencies (basic-pitch, librosa, music21, etc.)
- Download and configure the hum2midi script
- Add melody-pipeline to your PATH
Manual Install:
If you prefer manual setup:
CODEBLOCK2
- 5. Add to your PATH (optional):
CODEBLOCK3
Verify Installation
CODEBLOCK4
Usage
Basic Usage
Convert a voice memo to MIDI:
CODEBLOCK5
This creates my_humming.mid with 16th-note quantization.
Specify Output File
CODEBLOCK6
Command-Line Options
| Option | Description | Default |
|---|
| INLINECODE2 | Quantization grid: 1/4, 1/8, 1/16, INLINECODE6 | INLINECODE7 |
| INLINECODE8 |
Minimum note duration in milliseconds |
50 |
|
--no-quantize | Skip quantization (output raw Basic Pitch MIDI) | disabled |
|
--key-aware | Enable key-aware pitch correction | disabled |
|
--no-analysis | Skip pitch analysis and key detection | disabled |
Usage Examples
Quantize to eighth notes
CODEBLOCK7
Key-aware quantization (recommended for tonal music)
CODEBLOCK8
Require longer minimum notes
CODEBLOCK9
Skip analysis for faster processing
CODEBLOCK10
Combine options
CODEBLOCK11
Processing MIDI Input
You can also process existing MIDI files through the quantization pipeline:
CODEBLOCK12
This skips the audio processing steps and goes directly to analysis and quantization.
Sample Output
CODEBLOCK13
Notes & Limitations
Audio Quality Matters
- - Clear, loud melody produces the best results
- Background noise can cause false note detection
- Reverb and effects may confuse pitch detection
- Close-mic'd vocals work significantly better than room recordings
Musical Considerations
- - Monophonic sources work best (single melody line)
- Polyphonic audio (chords, multiple instruments) will produce messy results
- Vibrato and pitch bends may be quantized to stepped pitches
- Rapid note passages may be missed or merged
Technical Limitations
- - Tempo is fixed at 120 BPM in output (time positions are preserved, but tempo may need adjustment in your DAW)
- Note velocities are normalized but may need manual adjustment
- Very short notes (<50ms) may be filtered out by default
- Extreme pitch ranges may cause octave detection issues
Post-Processing Recommendations
After generating MIDI, you may want to:
- 1. Import into your DAW and adjust tempo to match your original recording
- Quantize further if stricter timing is needed
- Adjust note velocities for dynamics
- Apply swing/groove templates if the rigid grid sounds too mechanical
- Edit individual notes that were misdetected (common with fast runs)
Supported Audio Formats
Input formats supported via FFmpeg:
- - WAV, AIFF, FLAC (uncompressed, best quality)
- MP3, M4A, AAC (compressed, acceptable)
- OGG, OPUS (open source formats)
- Most other formats FFmpeg supports
Troubleshooting
No notes detected
- - Check that input file isn't silent or corrupted
- Try increasing
--min-note threshold - Verify audio has clear melodic content (not just noise)
Too many notes / messy output
- - Enable octave pruning and overlap pruning (on by default)
- Use
--key-aware to constrain to musical scale - Check for background noise in source audio
Wrong key detected
- - Key detection works best with at least 8-10 measures of music
- Chromatic passages may confuse the detector
- Manually review and adjust in your DAW if needed
Notes in wrong octave
- - Basic Pitch sometimes detects harmonics instead of fundamentals
- The pipeline includes pruning, but some may slip through
- Use your DAW's transpose function for simple octave shifts
References
License
This skill integrates Basic Pitch by Spotify, which is licensed under Apache 2.0. The pipeline script and documentation are provided under MIT license.
🎵 语音笔记转MIDI
将您的语音备忘录、哼唱和旋律录音转换为干净、量化后的MIDI文件,可直接用于您的数字音频工作站。
功能说明
本技能提供完整的音频转MIDI转换流程,包含:
- 1. 音源分离 - 使用HPSS(谐波-打击乐源分离)技术,从鼓点、噪音和背景声中分离出旋律内容
- 基于机器学习的音高检测 - 利用Spotify的Basic Pitch模型进行精确的基频提取
- 调性检测 - 使用Krumhansl-Kessler调性配置文件自动检测录音的音乐调性
- 智能量化 - 将音符对齐到可配置的时值网格,并支持可选的调性感知音高修正
- 后处理 - 应用八度修剪、基于重叠的和声移除以及连奏音符合并,确保输出干净
流程架构
音频输入(WAV/M4A/MP3)
↓
┌─────────────────────────────────────┐
│ 步骤1:音源分离(HPSS) │
│ - 分离谐波内容 │
│ - 移除鼓点/打击乐 │
│ - 噪声门控 │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 步骤2:音高检测 │
│ - Basic Pitch机器学习模型(Spotify)│
│ - 复音音符检测 │
│ - 起音/偏移估计 │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 步骤3:分析 │
│ - 音高类别分布 │
│ - 调性检测 │
│ - 主音识别 │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ 步骤4:量化与清理 │
│ - 时值网格对齐 │
│ - 调性感知音高修正 │
│ - 八度修剪(和声移除) │
│ - 基于重叠的修剪 │
│ - 音符合并(连奏) │
│ - 力度归一化 │
└─────────────────────────────────────┘
↓
MIDI输出(标准MIDI文件)
环境配置
前置条件
- - Python 3.11+(推荐Python 3.14+)
- FFmpeg(用于音频格式支持)
- pip
安装方法
快速安装(推荐):
bash
cd /path/to/voice-note-to-midi
./setup.sh
此自动化脚本将:
- - 检查Python 3.11+是否已安装
- 创建~/melody-pipeline目录
- 设置虚拟环境
- 安装所有依赖项(basic-pitch、librosa、music21等)
- 下载并配置hum2midi脚本
- 将melody-pipeline添加到您的PATH环境变量
手动安装:
如果您偏好手动设置:
bash
mkdir -p ~/melody-pipeline
cd ~/melody-pipeline
python3 -m venv venv-bp
source venv-bp/bin/activate
pip install basic-pitch librosa soundfile mido music21
chmod +x ~/melody-pipeline/hum2midi
- 5. 添加到PATH(可选):
bash
echo export PATH=$HOME/melody-pipeline:$PATH >> ~/.bashrc
source ~/.bashrc
验证安装
bash
cd ~/melody-pipeline
./hum2midi --help
使用方法
基本用法
将语音备忘录转换为MIDI:
bash
./hum2midi my_humming.wav
这将创建my_humming.mid文件,使用十六分音符量化。
指定输出文件
bash
./hum2midi input.wav output.mid
命令行选项
| 选项 | 描述 | 默认值 |
|---|
| --grid <值> | 量化网格:1/4、1/8、1/16、1/32 | 1/16 |
| --min-note <毫秒> |
最小音符时长(毫秒) | 50 |
| --no-quantize | 跳过量化(输出原始Basic Pitch MIDI) | 禁用 |
| --key-aware | 启用调性感知音高修正 | 禁用 |
| --no-analysis | 跳过音高分析和调性检测 | 禁用 |
使用示例
量化为八分音符
bash
./hum2midi melody.wav --grid 1/8
调性感知量化(推荐用于调性音乐)
bash
./hum2midi song.wav --key-aware
要求更长的最小音符
bash
./hum2midi humming.wav --min-note 100
跳过分析以加快处理速度
bash
./hum2midi quick.wav --no-analysis
组合选项
bash
./hum2midi recording.wav output.mid --grid 1/8 --key-aware --min-note 80
处理MIDI输入
您也可以通过量化流程处理现有的MIDI文件:
bash
./hum2midi input.mid output.mid --grid 1/16 --key-aware
这将跳过音频处理步骤,直接进入分析和量化阶段。
示例输出
═══════════════════════════════════════════════════════════════
hum2midi - 旋律转MIDI流程(Basic Pitch版)
[调性感知模式已启用]
═══════════════════════════════════════════════════════════════
输入: my_humming.wav
输出: my_humming.mid
→ 步骤1:音源分离(HPSS)
正在分离旋律内容...
已加载:5.23秒 @ 44100Hz
✓ 旋律音轨已提取 → 5.23秒
→ 步骤2:音频转MIDI转换(Basic Pitch)
正在对旋律音轨运行Spotify的Basic Pitch机器学习模型...
✓ 原始MIDI已生成(Basic Pitch)
→ 步骤3:音高分析与调性检测
检测到的音符:共42个,7个独特音高
音符范围:C3 - G4
音高类别:C3、E3、G3、A3、C4、D4、G4
主音:G3(占音符的23.8%)
检测到的调性:G大调
→ 步骤4:量化与清理
八度修剪:移除了67以上的3个和声音符(中位数+12)
重叠修剪:移除了重叠位置的2个和声音符
音符合并:将5个断奏片段合并为连奏音符(间隔<=60个时钟滴答)
网格: 240个时钟滴答(1/16)
音符: 38个音符
调性: G大调
调性感知:2个音符已修正至音阶
速度: 120 BPM
✓ 量化后的MIDI已保存
═══════════════════════════════════════════════════════════════
✓ 完成!输出:my_humming.mid
═══════════════════════════════════════════════════════════════
📊 分析摘要
─────────────────────────────────────────────────────────────
检测到的音符:C3、E3、G3、A3、C4、D4、G4
检测到的调性:G大调
量化方式:调性感知模式(音符对齐至音阶)
MIDI信息:38个音符,7个独特音高,120 BPM
音高:C3、E3、G3、A3、C4、D4、G4
注意事项与限制
音频质量至关重要
- - 清晰、响亮的旋律能产生最佳效果
- 背景噪音可能导致错误的音符检测
- 混响和效果可能干扰音高检测
- 近距离麦克风录制的人声效果远优于房间录音
音乐方面的考虑
- - 单声源效果最佳(单一旋律线)
- 复音音频(和弦、多种乐器)会产生杂乱的结果
- 颤音和音高弯曲可能被量化为阶梯式音高
- 快速音符段落可能被遗漏或合并
技术限制
- - 速度固定为输出中的120 BPM(时间位置保持不变,但速度可能需要在您的DAW中调整)
- 音符力度已归一化,但可能需要手动调整
- 极短的音符(<50毫秒)默认可能被过滤掉
- 极端音高范围可能导致八度检测问题
###