Discord Voice Plugin for Clawdbot
Real-time voice conversations in Discord voice channels. Join a voice channel, speak, and have your words transcribed, processed by Claude, and spoken back.
Features
- - Join/Leave Voice Channels: Via slash commands, CLI, or agent tool
- Voice Activity Detection (VAD): Automatically detects when users are speaking
- Speech-to-Text: Whisper API (OpenAI), Deepgram, or Local Whisper (Offline)
- Streaming STT: Real-time transcription with Deepgram WebSocket (~1s latency reduction)
- Agent Integration: Transcribed speech is routed through the Clawdbot agent
- Text-to-Speech: OpenAI TTS, ElevenLabs, or Kokoro (Local/Offline)
- Audio Playback: Responses are spoken back in the voice channel
- Barge-in Support: Stops speaking immediately when user starts talking
- Auto-reconnect: Automatic heartbeat monitoring and reconnection on disconnect
Requirements
- - Discord bot with voice permissions (Connect, Speak, Use Voice Activity)
- API keys for STT and TTS providers
- System dependencies for voice:
-
ffmpeg (audio processing)
- Native build tools for
@discordjs/opus and INLINECODE2
Installation
1. Install System Dependencies
CODEBLOCK0
2. Install via ClawdHub
CODEBLOCK1
Or manually:
CODEBLOCK2
3. Configure in clawdbot.json
CODEBLOCK3
4. Discord Bot Setup
Ensure your Discord bot has these permissions:
- - Connect - Join voice channels
- Speak - Play audio
- Use Voice Activity - Detect when users speak
Add these to your bot's OAuth2 URL or configure in Discord Developer Portal.
Configuration
| Option | Type | Default | Description |
|---|
| INLINECODE3 | boolean | INLINECODE4 | Enable/disable the plugin |
| INLINECODE5 |
string |
"local-whisper" |
"whisper",
"deepgram", or
"local-whisper" |
|
streamingSTT | boolean |
true | Use streaming STT (Deepgram only, ~1s faster) |
|
ttsProvider | string |
"openai" |
"openai" or
"elevenlabs" |
|
ttsVoice | string |
"nova" | Voice ID for TTS |
|
vadSensitivity | string |
"medium" |
"low",
"medium", or
"high" |
|
bargeIn | boolean |
true | Stop speaking when user talks |
|
allowedUsers | string[] |
[] | User IDs allowed (empty = all) |
|
silenceThresholdMs | number |
1500 | Silence before processing (ms) |
|
maxRecordingMs | number |
30000 | Max recording length (ms) |
|
heartbeatIntervalMs | number |
30000 | Connection health check interval |
|
autoJoinChannel | string |
undefined | Channel ID to auto-join on startup |
Provider Configuration
OpenAI (Whisper + TTS)
CODEBLOCK4
ElevenLabs (TTS only)
CODEBLOCK5
Deepgram (STT only)
CODEBLOCK6
Usage
Slash Commands (Discord)
Once registered with Discord, use these commands:
- -
/discord_voice join <channel> - Join a voice channel - INLINECODE36 - Leave the current voice channel
- INLINECODE37 - Show voice connection status
CLI Commands
CODEBLOCK7
Agent Tool
The agent can use the discord_voice tool:
CODEBLOCK8
The tool supports actions:
- -
join - Join a voice channel (requires channelId) - INLINECODE40 - Leave voice channel
- INLINECODE41 - Speak text in the voice channel
- INLINECODE42 - Get current voice status
How It Works
- 1. Join: Bot joins the specified voice channel
- Listen: VAD detects when users start/stop speaking
- Record: Audio is buffered while user speaks
- Transcribe: On silence, audio is sent to STT provider
- Process: Transcribed text is sent to Clawdbot agent
- Synthesize: Agent response is converted to audio via TTS
- Play: Audio is played back in the voice channel
Streaming STT (Deepgram)
When using Deepgram as your STT provider, streaming mode is enabled by default. This provides:
- - ~1 second faster end-to-end latency
- Real-time feedback with interim transcription results
- Automatic keep-alive to prevent connection timeouts
- Fallback to batch transcription if streaming fails
To use streaming STT:
CODEBLOCK9
Barge-in Support
When enabled (default), the bot will immediately stop speaking if a user starts talking. This creates a more natural conversational flow where you can interrupt the bot.
To disable (let the bot finish speaking):
CODEBLOCK10
Auto-reconnect
The plugin includes automatic connection health monitoring:
- - Heartbeat checks every 30 seconds (configurable)
- Auto-reconnect on disconnect with exponential backoff
- Max 3 attempts before giving up
If the connection drops, you'll see logs like:
CODEBLOCK11
VAD Sensitivity
- - low: Picks up quiet speech, may trigger on background noise
- medium: Balanced (recommended)
- high: Requires louder, clearer speech
Troubleshooting
"Discord client not available"
Ensure the Discord channel is configured and the bot is connected before using voice.
Opus/Sodium build errors
Install build tools:
CODEBLOCK12
No audio heard
- 1. Check bot has Connect + Speak permissions
- Check bot isn't server muted
- Verify TTS API key is valid
Transcription not working
- 1. Check STT API key is valid
- Check audio is being recorded (see debug logs)
- Try adjusting VAD sensitivity
Enable debug logging
CODEBLOCK13
Environment Variables
| Variable | Description |
|---|
| INLINECODE43 | Discord bot token (required) |
| INLINECODE44 |
OpenAI API key (Whisper + TTS) |
|
ELEVENLABS_API_KEY | ElevenLabs API key |
|
DEEPGRAM_API_KEY | Deepgram API key |
Limitations
- - Only one voice channel per guild at a time
- Maximum recording length: 30 seconds (configurable)
- Requires stable network for real-time audio
- TTS output may have slight delay due to synthesis
License
MIT
Clawdbot 的 Discord 语音插件
在 Discord 语音频道中进行实时语音对话。加入语音频道、说话,你的话语将被转录、由 Claude 处理,并以语音形式回复。
功能特性
- - 加入/离开语音频道:通过斜杠命令、CLI 或代理工具实现
- 语音活动检测 (VAD):自动检测用户何时说话
- 语音转文字:支持 Whisper API(OpenAI)、Deepgram 或本地 Whisper(离线)
- 流式 STT:通过 Deepgram WebSocket 实现实时转录(延迟降低约 1 秒)
- 代理集成:转录的语音通过 Clawdbot 代理进行处理
- 文字转语音:支持 OpenAI TTS、ElevenLabs 或 Kokoro(本地/离线)
- 音频播放:在语音频道中以语音形式回复
- 打断支持:用户开始说话时立即停止播放
- 自动重连:自动心跳监测,断开时自动重连
系统要求
- - 具有语音权限的 Discord 机器人(连接、说话、使用语音活动)
- STT 和 TTS 提供商的 API 密钥
- 语音相关的系统依赖:
- ffmpeg(音频处理)
- @discordjs/opus 和 sodium-native 的原生构建工具
安装指南
1. 安装系统依赖
bash
Ubuntu/Debian
sudo apt-get install ffmpeg build-essential python3
Fedora/RHEL
sudo dnf install ffmpeg gcc-c++ make python3
macOS
brew install ffmpeg
2. 通过 ClawdHub 安装
bash
clawdhub install discord-voice
或手动安装:
bash
cd ~/.clawdbot/extensions
git clone discord-voice
cd discord-voice
npm install
3. 在 clawdbot.json 中配置
json5
{
plugins: {
entries: {
discord-voice: {
enabled: true,
config: {
sttProvider: local-whisper,
ttsProvider: openai,
ttsVoice: nova,
vadSensitivity: medium,
allowedUsers: [], // 空数组表示允许所有用户
silenceThresholdMs: 1500,
maxRecordingMs: 30000,
openai: {
apiKey: sk-..., // 或使用 OPENAIAPIKEY 环境变量
},
},
},
},
},
}
4. Discord 机器人设置
确保你的 Discord 机器人拥有以下权限:
- - 连接 - 加入语音频道
- 说话 - 播放音频
- 使用语音活动 - 检测用户说话
将这些权限添加到机器人的 OAuth2 URL 中,或在 Discord 开发者门户中进行配置。
配置选项
| 选项 | 类型 | 默认值 | 描述 |
|---|
| enabled | boolean | true | 启用/禁用插件 |
| sttProvider |
string | local-whisper | whisper、deepgram 或 local-whisper |
| streamingSTT | boolean | true | 使用流式 STT(仅 Deepgram,快约 1 秒) |
| ttsProvider | string | openai | openai 或 elevenlabs |
| ttsVoice | string | nova | TTS 的语音 ID |
| vadSensitivity | string | medium | low、medium 或 high |
| bargeIn | boolean | true | 用户说话时停止播放 |
| allowedUsers | string[] | [] | 允许的用户 ID(空数组表示全部) |
| silenceThresholdMs | number | 1500 | 处理前的静音时长(毫秒) |
| maxRecordingMs | number | 30000 | 最大录音时长(毫秒) |
| heartbeatIntervalMs | number | 30000 | 连接健康检查间隔 |
| autoJoinChannel | string | undefined | 启动时自动加入的频道 ID |
提供商配置
OpenAI(Whisper + TTS)
json5
{
openai: {
apiKey: sk-...,
whisperModel: whisper-1,
ttsModel: tts-1,
},
}
ElevenLabs(仅 TTS)
json5
{
elevenlabs: {
apiKey: ...,
voiceId: 21m00Tcm4TlvDq8ikWAM, // Rachel
modelId: elevenmultilingualv2,
},
}
Deepgram(仅 STT)
json5
{
deepgram: {
apiKey: ...,
model: nova-2,
},
}
使用方法
斜杠命令(Discord)
在 Discord 中注册后,使用以下命令:
- - /discordvoice join - 加入语音频道
- /discordvoice leave - 离开当前语音频道
- /discord_voice status - 显示语音连接状态
CLI 命令
bash
加入语音频道
clawdbot discord_voice join
离开语音频道
clawdbot discord_voice leave --guild
检查状态
clawdbot discord_voice status
代理工具
代理可以使用 discord_voice 工具:
加入语音频道 1234567890
该工具支持以下操作:
- - join - 加入语音频道(需要 channelId)
- leave - 离开语音频道
- speak - 在语音频道中说话
- status - 获取当前语音状态
工作原理
- 1. 加入:机器人加入指定的语音频道
- 监听:VAD 检测用户何时开始/停止说话
- 录制:用户说话时缓冲音频
- 转录:检测到静音后,音频发送至 STT 提供商
- 处理:转录文本发送至 Clawdbot 代理
- 合成:代理响应通过 TTS 转换为音频
- 播放:在语音频道中播放音频
流式 STT(Deepgram)
使用 Deepgram 作为 STT 提供商时,默认启用流式模式。这提供了:
- - 端到端延迟降低约 1 秒
- 实时反馈,包含中间转录结果
- 自动保活,防止连接超时
- 回退机制,流式传输失败时使用批量转录
使用流式 STT:
json5
{
sttProvider: deepgram,
streamingSTT: true, // 默认值
deepgram: {
apiKey: ...,
model: nova-2,
},
}
打断支持
启用时(默认),如果用户开始说话,机器人会立即停止播放。这创造了更自然的对话流程,允许你打断机器人。
禁用(让机器人说完):
json5
{
bargeIn: false,
}
自动重连
插件包含自动连接健康监测:
- - 每 30 秒心跳检查(可配置)
- 断开时自动重连,采用指数退避策略
- 最多 3 次尝试,之后放弃
如果连接断开,你会看到类似日志:
[discord-voice] 与语音频道断开连接
[discord-voice] 重连尝试 1/3
[discord-voice] 重连成功
VAD 灵敏度
- - low:捕捉轻声说话,可能触发背景噪音
- medium:平衡(推荐)
- high:需要更响亮、更清晰的语音
故障排除
Discord 客户端不可用
确保 Discord 频道已配置,且机器人在使用语音前已连接。
Opus/Sodium 构建错误
安装构建工具:
bash
npm install -g node-gyp
npm rebuild @discordjs/opus sodium-native
听不到音频
- 1. 检查机器人是否拥有连接 + 说话权限
- 检查机器人是否被服务器静音
- 验证 TTS API 密钥是否有效
转录不工作
- 1. 检查 STT API