Her Voice 🎙️
Give your agent a voice. Audio responses powered by Kokoro TTS — a compact, naturally expressive model running entirely on-device.
✨ Features
Highly optimized response time thanks to on-the-fly audio streaming technology. 100% free, no API keys required. Inspired by Samantha and Sky.
- - ⚡ On-the-fly Streaming — Audio plays as it generates, very low latency
- 👄 The Voice of an angel — Cutting-edge local text-to-speech model Kokoro TTS
- 🧠 TTS Daemon — Keep the model warm in RAM for instant responses (can be disabled to save RAM)
- 🖥️ Persist Mode — Drag & drop audio, paste text, use as a voice station
- 🔧 Fully Configurable — Voice, speed, visualizer, notification sounds
- 🍎 MLX + PyTorch — Native Metal acceleration on Apple Silicon, PyTorch fallback everywhere else
- 🎨 Real-time Visualizer — Floating 60fps LED bars that react to speech (macOS only)
First-Run Setup
CODEBLOCK0
Note: SKILL_DIR is the root directory of this skill — the agent resolves it automatically when running commands.
The setup wizard will:
- 1. Detect platform and select TTS engine (MLX on Apple Silicon, PyTorch elsewhere)
- Find or install the appropriate TTS backend (mlx-audio or kokoro)
- Install
espeak-ng (Homebrew on macOS, apt on Linux) - Patch espeak loader if needed (macOS compatibility)
- Compile the native visualizer binary (macOS only)
- Download the Kokoro model
- Create config at INLINECODE2
Check status anytime:
CODEBLOCK1
Post-Setup: Names & Pronunciation
After setup, configure the agent and user names:
CODEBLOCK2
TTS pronunciation tip: If the user's name is non-English, figure out a phonetic English spelling that Kokoro will pronounce correctly. Store it in user_name_tts and use that spelling whenever speaking the name aloud. The real name stays in user_name for display purposes.
Speaking Text
CODEBLOCK3
Options
| Flag | Description |
|---|
| INLINECODE5 | Skip the visualizer for this call |
| INLINECODE6 |
Keep visualizer open after playback ends |
|
--save PATH | Save audio to WAV file instead of playing |
|
--voice NAME | Override the configured voice |
|
--speed N | Override the configured speed multiplier |
|
--mode MODE | Override visualizer mode (
v2 or
classic) |
Agent Workflow
When the user wants voice responses:
- 1. Check voice mode — is voice enabled or did the user ask for it?
- Play notification sound (instant feedback while TTS generates):
afplay /System/Library/Sounds/Blow.aiff &
- 3. Speak the response:
python3 SKILL_DIR/scripts/speak.py "Response text here"
- 4. Always provide text alongside voice — accessibility matters.
Notification Sound
The notification sound plays instantly (~0.1s) while TTS generates (~0.3-3s). This gives the user immediate feedback that the agent is responding.
Configure in ~/.her-voice/config.json:
CODEBLOCK6
Available macOS sounds: Blow, Bottle, Frog, Funk, Glass, Hero, Morse, Ping, Pop, Purr, Sosumi, Submarine, Tink. Located in /System/Library/Sounds/.
TTS Daemon
The daemon keeps the Kokoro model warm in RAM, eliminating ~1.1s of startup overhead per call.
The daemon auto-resolves the mlx-audio venv — no need to find the venv Python manually.
CODEBLOCK7
INLINECODE28 auto-detects the daemon: uses it if available, falls back to direct model loading.
The daemon is optional. Without it, speech still works — just ~1s slower per call as the model loads each time. Skip the daemon to save ~2.3GB RAM.
Note: The daemon writes its PID file and socket after the model is fully loaded and ready to accept connections. They live in ~/.her-voice/ with restricted permissions (owner-only access). The daemon won't survive a reboot — start it again after restart if needed.
Visualizer
A floating overlay with three animated LED bars that react to speech in real-time. 60fps, native macOS (Cocoa + AVFoundation). macOS only — on other platforms, audio plays without the visualizer.
Modes
- - v2 (default) — Three-tier pure red, center raw amplitude, sides with lag
- classic — Original smooth gradient look
Controls
Pause/Resume (file mode) |
| ← → | Seek ±5s (file mode) |
| ⌘V | Paste text to speak (persist mode) |
Persist Mode
Keep the visualizer on screen between playbacks. Use as a standalone voice station:
CODEBLOCK8
In persist mode:
- - Drag & drop audio files (.wav, .mp3, .aiff, .m4a) onto the visualizer to play them
- ⌘V pastes clipboard text → streams directly from TTS daemon with full visualizer animation
- Idle breathing — subtle center bar pulse when waiting for input
Standalone Usage
CODEBLOCK9
Disable Visualizer
CODEBLOCK10
Configuration
Config file: INLINECODE30
CODEBLOCK11
Key Settings
| Key | Default | Description |
|---|
| INLINECODE31 | INLINECODE32 | Agent's name (e.g. "Jackie") |
| INLINECODE33 |
"" | User's real name |
|
user_name_tts |
"" | Phonetic spelling for TTS (e.g. "Mah-toosh" for Matúš) |
|
voice |
af_heart | Base voice name |
|
voice_blend |
{af_heart: 0.6, af_sky: 0.4} | Voice blend weights |
|
speed |
1.05 | Speech speed multiplier |
|
language |
en | Language code |
|
tts_engine |
auto | TTS engine:
auto,
mlx, or
pytorch |
|
model |
mlx-community/Kokoro-82M-bf16 | Model identifier (MLX) |
|
visualizer.enabled |
true | Show visualizer overlay |
|
visualizer.mode |
v2 | Animation mode (v2/classic) |
|
visualizer.remember_position |
true | Save window position between sessions |
|
notification_sound.enabled |
true | Play sound before speaking |
|
notification_sound.sound |
Blow | macOS system sound name |
|
daemon.auto_start |
true | Advisory flag only — the daemon never self-starts. When
true, the agent should start it on first voice use (saves ~1s/call, costs ~2.3GB RAM) |
|
daemon.socket_path |
~/.her-voice/tts.sock | Unix socket path |
Voice Selection
Voice Blending
Mix multiple voices for a unique sound. Configure voice_blend in config:
CODEBLOCK12
The blended voice is stored as a .safetensors file in the model's voices directory (e.g., af_heart_60_af_sky_40.safetensors). Create it by running TTS once — speak.py looks for the pre-blended file automatically.
Error Handling
| Error | Cause | Fix |
|---|
| "mlx-audio not found" | Venv missing or broken | Run INLINECODE71 |
| "espeak-ng not found" |
Phonemizer missing |
brew install espeak-ng |
| Compilation failed | Xcode tools missing |
xcode-select --install |
| "Model not found" | First run, no download | Run
setup.py or speak once |
| Daemon "not running" | Crashed or rebooted | Start daemon again |
| No sound output | macOS audio permissions | Check System Settings → Sound → Output |
| Visualizer not showing | Binary not compiled | Run
setup.py |
| "kokoro not found" | PyTorch venv missing | Run
setup.py |
| PyTorch CUDA error | GPU driver mismatch |
pip install torch --force-reinstall in kokoro venv |
| "soundfile not found" | Missing dependency |
pip install soundfile in kokoro venv |
Requirements
- - macOS + Apple Silicon recommended for best experience (MLX engine + visualizer + notification sounds)
- Linux/Intel Mac supported via PyTorch Kokoro engine (no visualizer)
- Windows is not supported
- Xcode Command Line Tools for visualizer on macOS (
xcode-select --install) - INLINECODE80 for phonemization (
brew install espeak-ng on macOS, apt install espeak-ng on Linux) - ~500MB disk (model + venv)
- ~2.3GB RAM when daemon is running
Uninstall
Remove all Her Voice data (config, venvs, compiled binary, daemon state):
CODEBLOCK13
How It Works
- 1. Kokoro 82M — A compact neural TTS model with two backends: MLX (Apple's framework for native Metal GPU acceleration on Apple Silicon) and PyTorch (works everywhere). The engine is auto-detected based on platform, or can be forced via the
tts_engine config option (auto, mlx, or pytorch) - Streaming — Audio generates and plays simultaneously. First sound in ~0.3s (with daemon) vs ~3s batch
- Visualizer — Native macOS app (Swift/Cocoa) reads raw PCM from stdin, plays via AVAudioEngine with real-time amplitude metering
- Daemon — Unix socket server holding the model in RAM. Eliminates Python import + model load overhead on every call
Her Voice 🎙️
为你的智能体赋予声音。 由 Kokoro TTS 驱动的音频响应——一个完全在设备上运行的紧凑、自然表达模型。
✨ 功能特性
得益于即时音频流技术,响应时间得到高度优化。100% 免费,无需 API 密钥。灵感来源于 Samantha 和 Sky。
- - ⚡ 即时流式传输 — 音频边生成边播放,极低延迟
- 👄 天使之声 — 前沿本地文本转语音模型 Kokoro TTS
- 🧠 TTS 守护进程 — 将模型常驻内存以实现即时响应(可禁用以节省内存)
- 🖥️ 持久模式 — 拖放音频、粘贴文本,用作语音工作站
- 🔧 完全可配置 — 语音、速度、可视化器、通知音效
- 🍎 MLX + PyTorch — Apple Silicon 原生 Metal 加速,其他平台使用 PyTorch 回退
- 🎨 实时可视化器 — 60fps 浮动 LED 条,随语音实时响应(仅限 macOS)
首次运行设置
bash
python3 SKILL_DIR/scripts/setup.py
注意: SKILL_DIR 是本技能的根目录——智能体在运行命令时会自动解析。
设置向导将:
- 1. 检测平台并选择 TTS 引擎(Apple Silicon 上使用 MLX,其他平台使用 PyTorch)
- 查找或安装相应的 TTS 后端(mlx-audio 或 kokoro)
- 安装 espeak-ng(macOS 上使用 Homebrew,Linux 上使用 apt)
- 必要时修补 espeak 加载器(macOS 兼容性)
- 编译原生可视化器二进制文件(仅限 macOS)
- 下载 Kokoro 模型
- 在 ~/.her-voice/config.json 创建配置
随时检查状态:
bash
python3 SKILL_DIR/scripts/setup.py status
设置后:名称与发音
设置完成后,配置智能体和用户名称:
bash
python3 SKILLDIR/scripts/config.py set agentname Jackie
python3 SKILLDIR/scripts/config.py set username Matúš
python3 SKILLDIR/scripts/config.py set username_tts Mah-toosh
TTS 发音提示: 如果用户名非英语,请找出 Kokoro 能正确发音的英语音译拼写。将其存储在 usernametts 中,并在需要朗读名称时使用该拼写。真实名称保留在 user_name 中用于显示。
朗读文本
bash
基本用法
python3 SKILL_DIR/scripts/speak.py Hello, world!
跳过本次调用的可视化器
python3 SKILL_DIR/scripts/speak.py --no-viz Quick note
保存到文件而非播放
python3 SKILL_DIR/scripts/speak.py --save /tmp/output.wav Save this
覆盖语音或速度
python3 SKILL
DIR/scripts/speak.py --voice afbella --speed 1.2 Faster!
从标准输入管道输入文本
echo Piped text | python3 SKILL_DIR/scripts/speak.py
选项
| 标志 | 描述 |
|---|
| --no-viz | 跳过本次调用的可视化器 |
| --persist |
播放结束后保持可视化器打开 |
| --save PATH | 将音频保存为 WAV 文件而非播放 |
| --voice NAME | 覆盖已配置的语音 |
| --speed N | 覆盖已配置的速度倍率 |
| --mode MODE | 覆盖可视化器模式(v2 或 classic) |
智能体工作流程
当用户需要语音响应时:
- 1. 检查语音模式 — 语音是否已启用或用户是否要求语音?
- 播放通知音效(TTS 生成时的即时反馈):
bash
afplay /System/Library/Sounds/Blow.aiff &
- 3. 朗读响应:
bash
python3 SKILL_DIR/scripts/speak.py Response text here
- 4. 始终同时提供文本和语音 — 可访问性很重要。
通知音效
通知音效在 TTS 生成(约 0.3-3 秒)时即时播放(约 0.1 秒)。这给用户即时反馈,表明智能体正在响应。
在 ~/.her-voice/config.json 中配置:
json
{
notification_sound: {
enabled: true,
sound: Blow
}
}
可用的 macOS 音效:Blow、Bottle、Frog、Funk、Glass、Hero、Morse、Ping、Pop、Purr、Sosumi、Submarine、Tink。位于 /System/Library/Sounds/。
TTS 守护进程
守护进程将 Kokoro 模型常驻内存,消除每次调用约 1.1 秒的启动开销。
守护进程自动解析 mlx-audio 虚拟环境——无需手动查找虚拟环境 Python。
bash
启动(在后台持续运行)
nohup python3 SKILL_DIR/scripts/daemon.py start > /tmp/her-voice-daemon.log 2>&1 & disown
状态
python3 SKILL_DIR/scripts/daemon.py status
停止
python3 SKILL_DIR/scripts/daemon.py stop
重启
python3 SKILL_DIR/scripts/daemon.py restart
speak.py 自动检测守护进程:可用时使用,否则回退到直接加载模型。
守护进程是可选的。 没有它,语音仍然可以工作——只是每次调用会慢约 1 秒,因为模型需要每次加载。跳过守护进程可节省约 2.3GB 内存。
注意: 守护进程在模型完全加载并准备好接受连接后才会写入其 PID 文件和套接字。它们位于 ~/.her-voice/ 中,具有受限权限(仅所有者可访问)。守护进程在重启后不会保留——如果需要,请在重启后再次启动。
可视化器
一个浮动叠加层,包含三个动画 LED 条,实时响应语音。60fps,原生 macOS(Cocoa + AVFoundation)。仅限 macOS — 在其他平台上,音频播放时不显示可视化器。
模式
- - v2(默认)— 三层纯红色,中心原始振幅,两侧带延迟
- classic — 原始平滑渐变外观
控制
暂停/恢复(文件模式) |
| ← → | 快退/快进 ±5 秒(文件模式) |
| ⌘V | 粘贴要朗读的文本(持久模式) |
持久模式
在播放之间保持可视化器在屏幕上。用作独立语音工作站:
bash
以持久模式启动(保持打开,空闲呼吸动画)
~/.her-voice/bin/her-voice-viz --persist
流模式 + 持久(语音结束后保持打开)
python3 SKILL_DIR/scripts/speak.py --persist Hello!
在持久模式下:
- - 拖放音频文件(.wav、.mp3、.aiff、.m4a)到可视化器上进行播放
- ⌘V 粘贴剪贴板文本 → 直接从 TTS 守护进程流式传输,带有完整的可视化器动画
- 空闲呼吸 — 等待输入时中心条微妙脉冲
独立使用
bash
使用可视化器播放文件
~/.her-voice/bin/her-voice-viz --audio /path/to/file.wav
演示模式(模拟音频)
~/.her-voice/bin/her-voice-viz --demo
流式传输原始 PCM
cat audio.raw | ~/.her-voice/bin/her-voice-viz --stream --sample-rate 24000
禁用可视化器
bash
python3 SKILL_DIR/scripts/config.py set visualizer.enabled false
配置
配置文件:~/.her-voice/config.json
bash
查看所有设置
python3 SKILL_DIR/scripts/config.py status
获取值
python3 SKILL_DIR/scripts/config.py get voice
设置值(嵌套键使用点号表示法)
python3 SKILL_DIR/scripts/config.py set speed 1.1
python3 SKILL_DIR/scripts/config.py set visualizer.mode classic
关键设置
智能体