Her Voice 🎙️

Give your agent a voice. Audio responses powered by Kokoro TTS — a compact, naturally expressive model running entirely on-device.

✨ Features

Highly optimized response time thanks to on-the-fly audio streaming technology. 100% free, no API keys required. Inspired by Samantha and Sky.

- ⚡ On-the-fly Streaming — Audio plays as it generates, very low latency
👄 The Voice of an angel — Cutting-edge local text-to-speech model Kokoro TTS
🧠 TTS Daemon — Keep the model warm in RAM for instant responses (can be disabled to save RAM)
🖥️ Persist Mode — Drag & drop audio, paste text, use as a voice station
🔧 Fully Configurable — Voice, speed, visualizer, notification sounds
🍎 MLX + PyTorch — Native Metal acceleration on Apple Silicon, PyTorch fallback everywhere else
🎨 Real-time Visualizer — Floating 60fps LED bars that react to speech (macOS only)

First-Run Setup

CODEBLOCK0

Note: SKILL_DIR is the root directory of this skill — the agent resolves it automatically when running commands.

The setup wizard will:

1. Detect platform and select TTS engine (MLX on Apple Silicon, PyTorch elsewhere)
Find or install the appropriate TTS backend (mlx-audio or kokoro)
Install espeak-ng (Homebrew on macOS, apt on Linux)
Patch espeak loader if needed (macOS compatibility)
Compile the native visualizer binary (macOS only)
Download the Kokoro model
Create config at INLINECODE2

Check status anytime:
CODEBLOCK1

Post-Setup: Names & Pronunciation

After setup, configure the agent and user names:
CODEBLOCK2

TTS pronunciation tip: If the user's name is non-English, figure out a phonetic English spelling that Kokoro will pronounce correctly. Store it in user_name_tts and use that spelling whenever speaking the name aloud. The real name stays in user_name for display purposes.

Speaking Text

CODEBLOCK3

Options

Flag	Description
INLINECODE5	Skip the visualizer for this call
INLINECODE6

Agent Workflow

When the user wants voice responses:

1. Check voice mode — is voice enabled or did the user ask for it?
Play notification sound (instant feedback while TTS generates):

   afplay /System/Library/Sounds/Blow.aiff &

3. Speak the response:

   python3 SKILL_DIR/scripts/speak.py "Response text here"

4. Always provide text alongside voice — accessibility matters.

Notification Sound

The notification sound plays instantly (~0.1s) while TTS generates (~0.3-3s). This gives the user immediate feedback that the agent is responding.

Configure in ~/.her-voice/config.json:
CODEBLOCK6

Available macOS sounds: Blow, Bottle, Frog, Funk, Glass, Hero, Morse, Ping, Pop, Purr, Sosumi, Submarine, Tink. Located in /System/Library/Sounds/.

TTS Daemon

The daemon keeps the Kokoro model warm in RAM, eliminating ~1.1s of startup overhead per call.

The daemon auto-resolves the mlx-audio venv — no need to find the venv Python manually.

CODEBLOCK7

INLINECODE28 auto-detects the daemon: uses it if available, falls back to direct model loading.

The daemon is optional. Without it, speech still works — just ~1s slower per call as the model loads each time. Skip the daemon to save ~2.3GB RAM.

Note: The daemon writes its PID file and socket after the model is fully loaded and ready to accept connections. They live in ~/.her-voice/ with restricted permissions (owner-only access). The daemon won't survive a reboot — start it again after restart if needed.

Visualizer

A floating overlay with three animated LED bars that react to speech in real-time. 60fps, native macOS (Cocoa + AVFoundation). macOS only — on other platforms, audio plays without the visualizer.

Modes

- v2 (default) — Three-tier pure red, center raw amplitude, sides with lag
classic — Original smooth gradient look

Controls
Key Action
ESC Quit
Space
Pause/Resume (file mode) |

Key	Action
ESC	Quit
Space

| ← → | Seek ±5s (file mode) | | ⌘V | Paste text to speak (persist mode) |

Persist Mode

Keep the visualizer on screen between playbacks. Use as a standalone voice station: CODEBLOCK8

In persist mode:

- Drag & drop audio files (.wav, .mp3, .aiff, .m4a) onto the visualizer to play them
⌘V pastes clipboard text → streams directly from TTS daemon with full visualizer animation
Idle breathing — subtle center bar pulse when waiting for input

Standalone Usage

CODEBLOCK9

Disable Visualizer

CODEBLOCK10

Configuration

Config file: INLINECODE30

CODEBLOCK11

Key Settings

Key	Default	Description
INLINECODE31	INLINECODE32	Agent's name (e.g. "Jackie")
INLINECODE33

"" | User's real name | | user_name_tts | "" | Phonetic spelling for TTS (e.g. "Mah-toosh" for Matúš) | | voice | af_heart | Base voice name | | voice_blend | {af_heart: 0.6, af_sky: 0.4} | Voice blend weights | | speed | 1.05 | Speech speed multiplier | | language | en | Language code | | tts_engine | auto | TTS engine: auto, mlx, or pytorch | | model | mlx-community/Kokoro-82M-bf16 | Model identifier (MLX) | | visualizer.enabled | true | Show visualizer overlay | | visualizer.mode | v2 | Animation mode (v2/classic) | | visualizer.remember_position | true | Save window position between sessions | | notification_sound.enabled | true | Play sound before speaking | | notification_sound.sound | Blow | macOS system sound name | | daemon.auto_start | true | Advisory flag only — the daemon never self-starts. When true, the agent should start it on first voice use (saves ~1s/call, costs ~2.3GB RAM) | | daemon.socket_path | ~/.her-voice/tts.sock | Unix socket path |

Voice Selection

Voice Blending

Mix multiple voices for a unique sound. Configure voice_blend in config:
CODEBLOCK12

The blended voice is stored as a .safetensors file in the model's voices directory (e.g., af_heart_60_af_sky_40.safetensors). Create it by running TTS once — speak.py looks for the pre-blended file automatically.

Error Handling

Error	Cause	Fix
"mlx-audio not found"	Venv missing or broken	Run INLINECODE71
"espeak-ng not found"

Requirements

- macOS + Apple Silicon recommended for best experience (MLX engine + visualizer + notification sounds)
Linux/Intel Mac supported via PyTorch Kokoro engine (no visualizer)
Windows is not supported
Xcode Command Line Tools for visualizer on macOS (xcode-select --install)
INLINECODE80 for phonemization (brew install espeak-ng on macOS, apt install espeak-ng on Linux)
~500MB disk (model + venv)
~2.3GB RAM when daemon is running

Uninstall

Remove all Her Voice data (config, venvs, compiled binary, daemon state):
CODEBLOCK13

How It Works

1. Kokoro 82M — A compact neural TTS model with two backends: MLX (Apple's framework for native Metal GPU acceleration on Apple Silicon) and PyTorch (works everywhere). The engine is auto-detected based on platform, or can be forced via the tts_engine config option (auto, mlx, or pytorch)
Streaming — Audio generates and plays simultaneously. First sound in ~0.3s (with daemon) vs ~3s batch
Visualizer — Native macOS app (Swift/Cocoa) reads raw PCM from stdin, plays via AVAudioEngine with real-time amplitude metering
Daemon — Unix socket server holding the model in RAM. Eliminates Python import + model load overhead on every call

Her Voice 🎙️

为你的智能体赋予声音。 由 Kokoro TTS 驱动的音频响应——一个完全在设备上运行的紧凑、自然表达模型。

✨ 功能特性

得益于即时音频流技术，响应时间得到高度优化。100% 免费，无需 API 密钥。灵感来源于 Samantha 和 Sky。

- ⚡ 即时流式传输 — 音频边生成边播放，极低延迟
👄 天使之声 — 前沿本地文本转语音模型 Kokoro TTS
🧠 TTS 守护进程 — 将模型常驻内存以实现即时响应（可禁用以节省内存）
🖥️ 持久模式 — 拖放音频、粘贴文本，用作语音工作站
🔧 完全可配置 — 语音、速度、可视化器、通知音效
🍎 MLX + PyTorch — Apple Silicon 原生 Metal 加速，其他平台使用 PyTorch 回退
🎨 实时可视化器 — 60fps 浮动 LED 条，随语音实时响应（仅限 macOS）

首次运行设置

bash
python3 SKILL_DIR/scripts/setup.py

注意： SKILL_DIR 是本技能的根目录——智能体在运行命令时会自动解析。

设置向导将：

1. 检测平台并选择 TTS 引擎（Apple Silicon 上使用 MLX，其他平台使用 PyTorch）
查找或安装相应的 TTS 后端（mlx-audio 或 kokoro）
安装 espeak-ng（macOS 上使用 Homebrew，Linux 上使用 apt）
必要时修补 espeak 加载器（macOS 兼容性）
编译原生可视化器二进制文件（仅限 macOS）
下载 Kokoro 模型
在 ~/.her-voice/config.json 创建配置

随时检查状态：
bash
python3 SKILL_DIR/scripts/setup.py status

设置后：名称与发音

设置完成后，配置智能体和用户名称：
bash
python3 SKILLDIR/scripts/config.py set agentname Jackie
python3 SKILLDIR/scripts/config.py set username Matúš
python3 SKILLDIR/scripts/config.py set username_tts Mah-toosh

TTS 发音提示： 如果用户名非英语，请找出 Kokoro 能正确发音的英语音译拼写。将其存储在 usernametts 中，并在需要朗读名称时使用该拼写。真实名称保留在 user_name 中用于显示。

朗读文本

bash

基本用法

python3 SKILL_DIR/scripts/speak.py Hello, world!

跳过本次调用的可视化器

python3 SKILL_DIR/scripts/speak.py --no-viz Quick note

保存到文件而非播放

python3 SKILL_DIR/scripts/speak.py --save /tmp/output.wav Save this

覆盖语音或速度

python3 SKILLDIR/scripts/speak.py --voice afbella --speed 1.2 Faster!

从标准输入管道输入文本

echo Piped text | python3 SKILL_DIR/scripts/speak.py

选项

标志	描述
--no-viz	跳过本次调用的可视化器
--persist

智能体工作流程

当用户需要语音响应时：

1. 检查语音模式 — 语音是否已启用或用户是否要求语音？
播放通知音效（TTS 生成时的即时反馈）：

bash afplay /System/Library/Sounds/Blow.aiff &

3. 朗读响应：

bash python3 SKILL_DIR/scripts/speak.py Response text here

4. 始终同时提供文本和语音 — 可访问性很重要。

通知音效

通知音效在 TTS 生成（约 0.3-3 秒）时即时播放（约 0.1 秒）。这给用户即时反馈，表明智能体正在响应。

在 ~/.her-voice/config.json 中配置：
json
{
notification_sound: {
enabled: true,
sound: Blow
}
}

可用的 macOS 音效：Blow、Bottle、Frog、Funk、Glass、Hero、Morse、Ping、Pop、Purr、Sosumi、Submarine、Tink。位于 /System/Library/Sounds/。

TTS 守护进程

守护进程将 Kokoro 模型常驻内存，消除每次调用约 1.1 秒的启动开销。

守护进程自动解析 mlx-audio 虚拟环境——无需手动查找虚拟环境 Python。

bash

启动（在后台持续运行）

nohup python3 SKILL_DIR/scripts/daemon.py start > /tmp/her-voice-daemon.log 2>&1 & disown

状态

python3 SKILL_DIR/scripts/daemon.py status

停止

python3 SKILL_DIR/scripts/daemon.py stop

重启

python3 SKILL_DIR/scripts/daemon.py restart

speak.py 自动检测守护进程：可用时使用，否则回退到直接加载模型。

守护进程是可选的。 没有它，语音仍然可以工作——只是每次调用会慢约 1 秒，因为模型需要每次加载。跳过守护进程可节省约 2.3GB 内存。

注意： 守护进程在模型完全加载并准备好接受连接后才会写入其 PID 文件和套接字。它们位于 ~/.her-voice/ 中，具有受限权限（仅所有者可访问）。守护进程在重启后不会保留——如果需要，请在重启后再次启动。

可视化器

一个浮动叠加层，包含三个动画 LED 条，实时响应语音。60fps，原生 macOS（Cocoa + AVFoundation）。仅限 macOS — 在其他平台上，音频播放时不显示可视化器。

模式

- v2（默认）— 三层纯红色，中心原始振幅，两侧带延迟
classic — 原始平滑渐变外观

控制
按键操作
ESC 退出
Space
暂停/恢复（文件模式） |

按键	操作
ESC	退出
Space

| ← → | 快退/快进 ±5 秒（文件模式） | | ⌘V | 粘贴要朗读的文本（持久模式） |

持久模式

在播放之间保持可视化器在屏幕上。用作独立语音工作站： bash

以持久模式启动（保持打开，空闲呼吸动画）

~/.her-voice/bin/her-voice-viz --persist

流模式 + 持久（语音结束后保持打开）

python3 SKILL_DIR/scripts/speak.py --persist Hello!

在持久模式下：

- 拖放音频文件（.wav、.mp3、.aiff、.m4a）到可视化器上进行播放
⌘V 粘贴剪贴板文本 → 直接从 TTS 守护进程流式传输，带有完整的可视化器动画
空闲呼吸 — 等待输入时中心条微妙脉冲

独立使用

bash

使用可视化器播放文件

~/.her-voice/bin/her-voice-viz --audio /path/to/file.wav

演示模式（模拟音频）

~/.her-voice/bin/her-voice-viz --demo

流式传输原始 PCM

cat audio.raw | ~/.her-voice/bin/her-voice-viz --stream --sample-rate 24000

禁用可视化器

bash python3 SKILL_DIR/scripts/config.py set visualizer.enabled false

配置

配置文件：~/.her-voice/config.json

bash

查看所有设置

python3 SKILL_DIR/scripts/config.py status

获取值

python3 SKILL_DIR/scripts/config.py get voice

设置值（嵌套键使用点号表示法）

python3 SKILL_DIR/scripts/config.py set speed 1.1 python3 SKILL_DIR/scripts/config.py set visualizer.mode classic

关键设置

键	默认值	描述
agent_name

智能体

Her Voice赋予声音

Her Voice

Her Voice 🎙️

✨ Features

First-Run Setup

Post-Setup: Names & Pronunciation

Speaking Text

Options

Agent Workflow

Notification Sound

TTS Daemon

Visualizer

Modes

ControlsKeyActionESCQuitSpace Pause/Resume (file mode) |

Persist Mode

Standalone Usage

Disable Visualizer

Configuration

Key Settings

Voice Selection

Voice Blending

Error Handling

Requirements

Uninstall

How It Works

Her Voice 🎙️

✨ 功能特性

首次运行设置

设置后：名称与发音

朗读文本

基本用法

跳过本次调用的可视化器

保存到文件而非播放

覆盖语音或速度

从标准输入管道输入文本

选项

智能体工作流程

通知音效

TTS 守护进程

启动（在后台持续运行）

状态

停止

重启

可视化器

模式

控制按键操作ESC退出Space 暂停/恢复（文件模式） |

持久模式

以持久模式启动（保持打开，空闲呼吸动画）

流模式 + 持久（语音结束后保持打开）

独立使用

使用可视化器播放文件

演示模式（模拟音频）

流式传输原始 PCM

禁用可视化器

配置

查看所有设置

获取值

设置值（嵌套键使用点号表示法）

关键设置

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

Controls
Key Action
ESC Quit
Space
Pause/Resume (file mode) |

控制
按键操作
ESC 退出
Space
暂停/恢复（文件模式） |