Augent — Audio & Video Intelligence for AI Agents

Augent is an MCP server that gives your agent 22 tools for audio and video intelligence. Download from 1000+ sites via yt-dlp and aria2c, transcribe in 99 languages via faster-whisper, search by keyword or meaning via sentence-transformers, take notes, identify speakers via pyannote-audio, detect chapters, separate audio via Demucs v4, export clips, extract visual frames, record X/Twitter Spaces (requires user-configured auth token in ~/.augent/auth.json), and generate speech via Kokoro TTS. All processing runs locally. Downloads are saved to ~/Downloads/, notes and clips to ~/Desktop/, transcription memory to ~/.augent/memory/.

Config

CODEBLOCK0

If augent-mcp is not in PATH, use python3 -m augent.mcp as the command instead.

Install

Install via the ClawHub install button above, or use uv tool install augent for the base package or uv tool install "augent[all]" for all features. FFmpeg is required for audio processing.

Tools

Augent exposes 22 MCP tools:

Core

Tool	Description
INLINECODE8	Download audio from video URLs at maximum speed. Supports YouTube, Vimeo, TikTok, Twitter/X, SoundCloud, and 1000+ sites. Uses aria2c multi-connection + concurrent fragments.
INLINECODE9

Full transcription of any audio file with per-segment timestamps. Returns text, language, duration, and segments. Cached by file hash. | | search_audio | Search audio for keywords. Returns timestamped matches with context snippets. Supports clip export. | | deep_search | Semantic search — find moments by meaning, not just keywords. Uses sentence-transformers embeddings. | | search_memory | Search across ALL stored transcriptions in one query. Keyword or semantic mode. | | take_notes | All-in-one: download audio from URL, transcribe, and save formatted notes. Supports 5 styles: tldr, notes, highlight, eye-candy, quiz. | | clip_export | Export a video clip from any URL for a specific time range. Downloads only the requested segment. |

Analysis

Tool	Description
INLINECODE15	Auto-detect topic chapters with timestamps using embedding similarity.
INLINECODE16

Find where two keywords appear near each other (e.g., "startup" within 30 words of "funding"). | | identify_speakers | Speaker diarization — identify who speaks when. No API keys required. | | separate_audio | Isolate vocals from music/noise using Meta's Demucs v4. Feed clean vocals into transcription. | | batch_search | Search multiple audio files in parallel. Ideal for podcast libraries or interview collections. |

Utilities

Tool	Description
INLINECODE20	Convert text to natural speech using Kokoro TTS. 54 voices, 9 languages. Runs in background.
INLINECODE21

List media files in a directory with size info. | | list_memories | Browse all stored transcriptions by title, duration, and date. | | memory_stats | View memory statistics (file count, total duration). | | clear_memory | Clear the transcription memory to free disk space. | | tag | Add, remove, or list tags on transcriptions. Broad topic categories for organizing memories. | | highlights | Export the best moments from a transcription. Auto mode picks top moments; focused mode finds moments matching a topic. | | visual | Extract visual context from video at moments that matter. Query, auto, manual, and assist modes. Frames saved to Obsidian vault. | | rebuild_graph | Rebuild Obsidian graph view data for all transcriptions. Migrates files, computes wikilinks, generates MOC hubs. | | spaces | Download or live-record X/Twitter Spaces. Start, check status, or stop recordings. |

Usage Examples

Take notes from a video

"Take notes from https://youtube.com/watch?v=xxx"

The agent calls take_notes which downloads, transcribes, and returns formatted notes. One tool call does everything.

Search a podcast for topics

"Search this podcast for every mention of AI regulation" — provide the file path or URL.

The agent uses search_audio for exact keyword matches, or deep_search for semantic matches (finds relevant discussion even without exact words).

Transcribe and identify speakers

"Transcribe this meeting recording and tell me who said what"

The agent calls transcribe_audio then identify_speakers to label each segment by speaker.

Search across all transcriptions

"Search everything I've ever transcribed for mentions of funding"

The agent uses search_memory to search across all stored transcriptions without needing a file path.

Export a clip

"Clip the part where they talk about pricing"

The agent uses search_audio or deep_search to find the moment, then clip_export to extract just that segment.

Separate vocals from noisy audio

"This recording has music in the background, clean it up and transcribe"

The agent calls separate_audio to isolate vocals, then transcribe_audio on the clean vocals track.

Generate speech from text

"Read these notes aloud"

The agent calls text_to_speech to generate an MP3 with natural speech. Supports multiple voices and languages.

Note Styles

When using take_notes, the style parameter controls formatting:

Style	Description
INLINECODE44	Shortest possible summary. One screen. Bold key terms.
INLINECODE45

Model Sizes

INLINECODE49 is the default and handles nearly everything. Only use larger models for heavy accents, poor audio quality, or maximum accuracy needs.

Model	Speed	Accuracy
tiny	Fastest	Excellent (default)
base

File Paths

Augent reads and writes to these locations on your machine:

Path	Purpose
INLINECODE50	Default directory for downloaded audio files
INLINECODE51

If Obsidian is installed, visual frames are saved to the Obsidian vault's External Files/visual/ directory. The vault path is auto-detected from Obsidian's config.

Network Access

Network access is used for two purposes only:

1. Downloading media from user-provided URLs via yt-dlp and aria2c
Downloading ML models on first use (Whisper, sentence-transformers, pyannote, Demucs, Kokoro) from Hugging Face

No telemetry. No background network activity. No data is uploaded.

ML Dependencies

The augent[all] install includes these local ML components:

Component	Purpose	Size
faster-whisper	Speech-to-text transcription	~75MB (tiny model)
sentence-transformers

All models run locally. None require API keys or cloud services.

Requirements

- Python 3.10+
FFmpeg (audio processing)
yt-dlp + aria2c (for audio downloads)

Augent — 面向AI代理的音频与视频智能

Augent是一个MCP服务器，为您的代理提供22种音频和视频智能工具。通过yt-dlp和aria2c从1000多个网站下载，通过faster-whisper支持99种语言的转录，通过sentence-transformers进行关键词或语义搜索，做笔记，通过pyannote-audio识别说话人，检测章节，通过Demucs v4分离音频，导出片段，提取视觉帧，录制X/Twitter Spaces（需要在~/.augent/auth.json中配置用户认证令牌），并通过Kokoro TTS生成语音。所有处理均在本地运行。下载文件保存到~/Downloads/，笔记和片段保存到~/Desktop/，转录记忆保存到~/.augent/memory/。

配置

json
{
mcpServers: {
augent: {
command: augent-mcp
}
}
}

如果augent-mcp不在PATH中，请使用python3 -m augent.mcp作为命令替代。

安装

通过上方的ClawHub安装按钮进行安装，或使用uv tool install augent安装基础包，或使用uv tool install augent[all]安装所有功能。音频处理需要FFmpeg。

工具

Augent提供22个MCP工具：

核心

工具	描述
downloadaudio	以最快速度从视频URL下载音频。支持YouTube、Vimeo、TikTok、Twitter/X、SoundCloud及1000多个网站。使用aria2c多连接+并发分片。
transcribeaudio

分析

工具	描述
chapters	使用嵌入相似度自动检测带时间戳的主题章节。
search_proximity

实用工具

工具	描述
texttospeech	使用Kokoro TTS将文本转换为自然语音。54种声音，9种语言。后台运行。
list_files

使用示例

从视频做笔记

从 https://youtube.com/watch?v=xxx 做笔记

代理调用take_notes，该工具会下载、转录并返回格式化笔记。一次工具调用完成所有操作。

搜索播客主题

搜索这个播客中所有提到AI监管的地方——提供文件路径或URL。

代理使用searchaudio进行精确关键词匹配，或使用deepsearch进行语义匹配（即使没有精确词语也能找到相关讨论）。

转录并识别说话人

转录这个会议录音，告诉我谁说了什么

代理先调用transcribeaudio，然后调用identifyspeakers为每个段落标注说话人。

跨所有转录搜索

搜索我所有转录过的内容中关于融资的提及

代理使用search_memory在所有存储的转录中搜索，无需提供文件路径。

导出片段

剪辑他们讨论定价的部分

代理使用searchaudio或deepsearch找到该时刻，然后使用clip_export提取该片段。

从嘈杂音频中分离人声

这个录音有背景音乐，清理一下并转录

代理调用separateaudio分离人声，然后在纯净人声轨道上调用transcribeaudio。

从文本生成语音

大声朗读这些笔记

代理调用texttospeech生成自然语音的MP3文件。支持多种声音和语言。

笔记风格

使用take_notes时，style参数控制格式：

风格	描述
tldr	尽可能短的摘要。一屏显示。加粗关键术语。
notes

模型大小

tiny是默认模型，几乎能处理所有情况。仅在处理重口音、音频质量差或需要最高精度时使用更大的模型。

模型	速度	精度
tiny	最快	优秀（默认）
base

快 | 优秀 |
| small | 中等 | 卓越 |
| medium | 慢 | 出色 |
| large | 最慢 | 最高 |

文件路径

Augent在您机器上的以下位置读写文件：

路径	用途
~/Downloads/	下载音频文件的默认目录
~/Desktop/

如果安装了Obsidian，视觉帧将保存到Obsidian仓库的External Files/visual/目录。仓库路径从Obsidian的配置中自动检测。

网络访问

网络访问仅用于两个目的：

1. 通过yt-dlp和aria2c从用户提供的URL下载媒体
首次使用时从Hugging Face下载ML模型（Whisper、sentence-transformers、pyannote、Demucs、Kokoro）

无遥测。无后台网络活动。不上传任何数据。

ML依赖

augent[all]安装包含以下本地ML组件：

组件	用途	大小
faster-whisper	语音转文本转录	~75MB（tiny模型）
sentence-transformers

所有模型均在本地运行。无需API密钥或云服务。

要求

- Python 3.10+
FFmpeg（音频处理）
yt-dlp + aria2c（用于音频下载）

链接

- [GitHub](https://github

augent智能视听层

augent

Augent — Audio & Video Intelligence for AI Agents

Config

Install

Tools

Core

Analysis

Utilities

Usage Examples

Take notes from a video

Search a podcast for topics

Transcribe and identify speakers

Search across all transcriptions

Export a clip

Separate vocals from noisy audio

Generate speech from text

Note Styles

Model Sizes

File Paths

Network Access

ML Dependencies

Requirements

Links

Augent — 面向AI代理的音频与视频智能

配置

安装

工具

核心

分析

实用工具

使用示例

从视频做笔记

搜索播客主题

转录并识别说话人

跨所有转录搜索

导出片段

从嘈杂音频中分离人声

从文本生成语音

笔记风格

模型大小

文件路径

网络访问

ML依赖

要求

链接

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement