OmniVoice

Ten operations across four capabilities: identify (认) · manage (存) · transcribe (听) · clone (说).

Dependencies

Component	Install	Purpose
Whisper	INLINECODE0	Speech-to-text
Speaker ID

Voice references are stored in voice-refs/ at workspace root.
Metadata lives in TOOLS.md under a "Voice Library" section.
See references/voice-library-format.md for format spec.

Operations

Op 1 · Speaker Identification (声纹查询)

Input: audio → Output: who is speaking (or "unknown")

CODEBLOCK0

Compares audio against all voice-refs/*-ref*.* using UniSpeech-SAT x-vector embeddings.
First run downloads model (~360MB) to /tmp/hf_models/.

Accuracy: Reliably separates male/female voices. Same-gender speakers need ≥5s audio for best results. Threshold 0.75 is default; raise to 0.85 for stricter matching.

Op 2 · Add Voice to Library (声音入库)

Input: audio + speaker name → stores in voice library

1. Copy audio to INLINECODE7
Transcribe to get reference text: INLINECODE8
Add entry to TOOLS.md (see format in references/)
Register speaker in voice_identify.py INLINECODE11

Good reference audio: 10-15s clear speech, minimal noise, natural pace. 5s minimum.

Op 3 · Voice Library CRUD (声音库管理)

- List: Check TOOLS.md voice library section + INLINECODE13
Add: See Op 2
Update: Replace file in voice-refs/, update TOOLS.md entry
Delete: Remove file from voice-refs/, remove TOOLS.md entry, remove from INLINECODE18

Op 4 · Voice Clone (声音克隆)

Input: text + library speaker → Output: audio in that speaker's voice

CODEBLOCK1

Long reference (>15s): truncate first with ffmpeg -y -i <ref> -t 15 -ar 24000 -ac 1 /tmp/ref_trimmed.wav.

Op 5 · Transcribe (纯转文字)

Input: audio → Output: text

CODEBLOCK2

Languages: zh (Chinese), en (English), ja (Japanese). Omit for auto-detect.

Op 6 · Transcribe + Identify (转文字+识别)

Input: audio → Output: who said what

Run Op 5 and Op 1 in parallel, report both results together.

Op 7 · Speaker Verification (声纹验证)

Input: two audio files → Output: same person or not

CODEBLOCK3

Compare the top-ranked speaker from both runs. If they match → same person.
For direct pairwise comparison without a library, extract embeddings and compute cosine similarity (see voice_identify.py internals).

Op 8 · Voice Swap (声音换皮)

Input: audio + library speaker → Output: same words, different voice

1. Transcribe input audio (Op 5)
Clone with target speaker's voice (Op 4), using transcribed text

Op 9 · Persona Voice Reply — from Audio (人格化语音回复·语音版)

Input: audio question + library speaker → Output: AI answer in that speaker's voice

1. Transcribe the question (Op 5)
Generate answer text via LLM
Clone answer with target speaker's voice (Op 4)

Op 10 · Persona Voice Reply — from Text (人格化语音回复·文字版)

Input: text question + library speaker → Output: AI answer in that speaker's voice

1. Generate answer text via LLM
Clone answer with target speaker's voice (Op 4)

Send Audio (Feishu)

CODEBLOCK4

Converts wav → opus, uploads, sends as voice message.
Requires FEISHU_APP_ID + FEISHU_APP_SECRET env vars.

Extract Audio from Video

CODEBLOCK5

OmniVoice

四项能力下的十项操作：识别（认）·管理（存）·转写（听）·克隆（说）。

依赖项

组件	安装方式	用途
Whisper	pip install openai-whisper	语音转文字
说话人识别

语音参考文件存储在 workspace 根目录的 voice-refs/ 文件夹中。
元数据存放在 TOOLS.md 文件的语音库部分。
格式规范请参见 references/voice-library-format.md。

操作

操作 1 · 说话人识别（声纹查询）

输入：音频 → 输出：说话人身份（或未知）

bash
python3 scripts/voice_identify.py <音频文件> [--threshold 0.75]

使用 UniSpeech-SAT x-vector 嵌入向量，将音频与所有 voice-refs/-ref.* 文件进行比对。
首次运行时会下载模型（约 360MB）到 /tmp/hf_models/ 目录。

准确度： 能可靠区分男声和女声。同性别说话人需要 ≥5 秒音频才能获得最佳效果。默认阈值为 0.75；如需更严格的匹配，可提高至 0.85。

操作 2 · 声音入库

输入：音频 + 说话人名称 → 存入语音库

1. 将音频复制到 voice-refs/<名称>-ref1.<扩展名>
转写获取参考文本：whisper <音频> --model small --outputformat txt --outputdir /tmp
在 TOOLS.md 中添加条目（格式参见 references/）
在 voiceidentify.py 的 SPEAKERMAP 中注册说话人

优质参考音频： 10-15 秒清晰语音，噪音少，语速自然。最少 5 秒。

操作 3 · 语音库 CRUD（声音库管理）

- 列出： 查看 TOOLS.md 语音库部分 + ls voice-refs/
添加： 参见操作 2
更新： 替换 voice-refs/ 中的文件，更新 TOOLS.md 条目
删除： 从 voice-refs/ 中移除文件，删除 TOOLS.md 条目，从 SPEAKER_MAP 中移除

操作 4 · 声音克隆

输入：文本 + 语音库说话人 → 输出：该说话人声音的音频

bash
set -a; source <包含SFAPIKEY的环境变量文件>; set +a

python3 scripts/cosyvoice_clone.py \
--text 要朗读的文本 \
--ref voice-refs/<说话人>-ref1.<扩展名> \
--ref-text 参考音频中的内容 \
--output /tmp/clone_output.wav

参考音频过长（>15 秒）：先用以下命令截取前 15 秒 ffmpeg -y -i <参考音频> -t 15 -ar 24000 -ac 1 /tmp/ref_trimmed.wav。

操作 5 · 纯转文字

输入：音频 → 输出：文本

bash
whisper <音频文件> --model small --outputformat txt --outputdir /tmp --language <语言>

支持语言：zh（中文）、en（英文）、ja（日文）。省略参数则自动检测。

操作 6 · 转文字+识别

输入：音频 → 输出：谁说了什么

同时运行操作 5 和操作 1，合并输出两个结果。

操作 7 · 声纹验证

输入：两个音频文件 → 输出：是否为同一人

bash
python3 scripts/voice_identify.py <音频1> --threshold 0.75
python3 scripts/voice_identify.py <音频2> --threshold 0.75

比较两次运行中排名最高的说话人。如果匹配 → 同一人。
如需不依赖语音库进行直接成对比较，可提取嵌入向量并计算余弦相似度（参见 voice_identify.py 内部实现）。

操作 8 · 声音换皮

输入：音频 + 语音库说话人 → 输出：相同内容，不同声音

1. 转写输入音频（操作 5）
使用目标说话人的声音进行克隆（操作 4），使用转写得到的文本

操作 9 · 人格化语音回复·语音版

输入：音频问题 + 语音库说话人 → 输出：该说话人声音的 AI 回答

1. 转写问题（操作 5）
通过 LLM 生成回答文本
使用目标说话人的声音克隆回答（操作 4）

操作 10 · 人格化语音回复·文字版

输入：文本问题 + 语音库说话人 → 输出：该说话人声音的 AI 回答

1. 通过 LLM 生成回答文本
使用目标说话人的声音克隆回答（操作 4）

发送音频（飞书）

bash
set -a; source <环境变量文件>; set +a
bash scripts/feishusendaudio.sh <接收者ID>

将 wav 转换为 opus 格式，上传，以语音消息形式发送。
需要 FEISHUAPPID + FEISHUAPPSECRET 环境变量。

从视频中提取音频

bash
ffmpeg -y -i <视频文件> -vn -ar 24000 -ac 1 /tmp/extracted_audio.wav

omnivoice全能声纹工具

omnivoice

OmniVoice

Dependencies

Operations

Op 1 · Speaker Identification (声纹查询)

Op 2 · Add Voice to Library (声音入库)

Op 3 · Voice Library CRUD (声音库管理)

Op 4 · Voice Clone (声音克隆)

Op 5 · Transcribe (纯转文字)

Op 6 · Transcribe + Identify (转文字+识别)

Op 7 · Speaker Verification (声纹验证)

Op 8 · Voice Swap (声音换皮)

Op 9 · Persona Voice Reply — from Audio (人格化语音回复·语音版)

Op 10 · Persona Voice Reply — from Text (人格化语音回复·文字版)

Send Audio (Feishu)

Extract Audio from Video

OmniVoice

依赖项

操作

操作 1 · 说话人识别（声纹查询）

操作 2 · 声音入库

操作 3 · 语音库 CRUD（声音库管理）

操作 4 · 声音克隆

操作 5 · 纯转文字

操作 6 · 转文字+识别

操作 7 · 声纹验证

操作 8 · 声音换皮

操作 9 · 人格化语音回复·语音版

操作 10 · 人格化语音回复·文字版

发送音频（飞书）

从视频中提取音频

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

omnivoice全能声纹工具

omnivoice

OmniVoice

Dependencies

Operations

Op 1 · Speaker Identification (声纹查询)

Op 2 · Add Voice to Library (声音入库)

Op 3 · Voice Library CRUD (声音库管理)

Op 4 · Voice Clone (声音克隆)

Op 5 · Transcribe (纯转文字)

Op 6 · Transcribe + Identify (转文字+识别)

Op 7 · Speaker Verification (声纹验证)

Op 8 · Voice Swap (声音换皮)

Op 9 · Persona Voice Reply — from Audio (人格化语音回复·语音版)

Op 10 · Persona Voice Reply — from Text (人格化语音回复·文字版)

Send Audio (Feishu)

Extract Audio from Video

OmniVoice

依赖项

操作

操作 1 · 说话人识别（声纹查询）

操作 2 · 声音入库

操作 3 · 语音库 CRUD（声音库管理）

操作 4 · 声音克隆

操作 5 · 纯转文字

操作 6 · 转文字+识别

操作 7 · 声纹验证

操作 8 · 声音换皮

操作 9 · 人格化语音回复·语音版

操作 10 · 人格化语音回复·文字版

发送音频（飞书）

从视频中提取音频

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement