Augent — Audio & Video Intelligence for AI Agents
Augent is an MCP server that gives your agent 22 tools for audio and video intelligence. Download from 1000+ sites via yt-dlp and aria2c, transcribe in 99 languages via faster-whisper, search by keyword or meaning via sentence-transformers, take notes, identify speakers via pyannote-audio, detect chapters, separate audio via Demucs v4, export clips, extract visual frames, record X/Twitter Spaces (requires user-configured auth token in ~/.augent/auth.json), and generate speech via Kokoro TTS. All processing runs locally. Downloads are saved to ~/Downloads/, notes and clips to ~/Desktop/, transcription memory to ~/.augent/memory/.
Config
CODEBLOCK0
If augent-mcp is not in PATH, use python3 -m augent.mcp as the command instead.
Install
Install via the ClawHub install button above, or use uv tool install augent for the base package or uv tool install "augent[all]" for all features. FFmpeg is required for audio processing.
Tools
Augent exposes 22 MCP tools:
Core
| Tool | Description |
|---|
| INLINECODE8 | Download audio from video URLs at maximum speed. Supports YouTube, Vimeo, TikTok, Twitter/X, SoundCloud, and 1000+ sites. Uses aria2c multi-connection + concurrent fragments. |
| INLINECODE9 |
Full transcription of any audio file with per-segment timestamps. Returns text, language, duration, and segments. Cached by file hash. |
|
search_audio | Search audio for keywords. Returns timestamped matches with context snippets. Supports clip export. |
|
deep_search | Semantic search — find moments by meaning, not just keywords. Uses sentence-transformers embeddings. |
|
search_memory | Search across ALL stored transcriptions in one query. Keyword or semantic mode. |
|
take_notes | All-in-one: download audio from URL, transcribe, and save formatted notes. Supports 5 styles: tldr, notes, highlight, eye-candy, quiz. |
|
clip_export | Export a video clip from any URL for a specific time range. Downloads only the requested segment. |
Analysis
| Tool | Description |
|---|
| INLINECODE15 | Auto-detect topic chapters with timestamps using embedding similarity. |
| INLINECODE16 |
Find where two keywords appear near each other (e.g., "startup" within 30 words of "funding"). |
|
identify_speakers | Speaker diarization — identify who speaks when. No API keys required. |
|
separate_audio | Isolate vocals from music/noise using Meta's Demucs v4. Feed clean vocals into transcription. |
|
batch_search | Search multiple audio files in parallel. Ideal for podcast libraries or interview collections. |
Utilities
| Tool | Description |
|---|
| INLINECODE20 | Convert text to natural speech using Kokoro TTS. 54 voices, 9 languages. Runs in background. |
| INLINECODE21 |
List media files in a directory with size info. |
|
list_memories | Browse all stored transcriptions by title, duration, and date. |
|
memory_stats | View memory statistics (file count, total duration). |
|
clear_memory | Clear the transcription memory to free disk space. |
|
tag | Add, remove, or list tags on transcriptions. Broad topic categories for organizing memories. |
|
highlights | Export the best moments from a transcription. Auto mode picks top moments; focused mode finds moments matching a topic. |
|
visual | Extract visual context from video at moments that matter. Query, auto, manual, and assist modes. Frames saved to Obsidian vault. |
|
rebuild_graph | Rebuild Obsidian graph view data for all transcriptions. Migrates files, computes wikilinks, generates MOC hubs. |
|
spaces | Download or live-record X/Twitter Spaces. Start, check status, or stop recordings. |
Usage Examples
Take notes from a video
"Take notes from https://youtube.com/watch?v=xxx"
The agent calls take_notes which downloads, transcribes, and returns formatted notes. One tool call does everything.
Search a podcast for topics
"Search this podcast for every mention of AI regulation" — provide the file path or URL.
The agent uses search_audio for exact keyword matches, or deep_search for semantic matches (finds relevant discussion even without exact words).
Transcribe and identify speakers
"Transcribe this meeting recording and tell me who said what"
The agent calls transcribe_audio then identify_speakers to label each segment by speaker.
Search across all transcriptions
"Search everything I've ever transcribed for mentions of funding"
The agent uses search_memory to search across all stored transcriptions without needing a file path.
Export a clip
"Clip the part where they talk about pricing"
The agent uses search_audio or deep_search to find the moment, then clip_export to extract just that segment.
Separate vocals from noisy audio
"This recording has music in the background, clean it up and transcribe"
The agent calls separate_audio to isolate vocals, then transcribe_audio on the clean vocals track.
Generate speech from text
"Read these notes aloud"
The agent calls text_to_speech to generate an MP3 with natural speech. Supports multiple voices and languages.
Note Styles
When using take_notes, the style parameter controls formatting:
| Style | Description |
|---|
| INLINECODE44 | Shortest possible summary. One screen. Bold key terms. |
| INLINECODE45 |
Clean sections with nested bullets (default). |
|
highlight | Notes with callout blocks for key insights and blockquotes with timestamps. |
|
eye-candy | Maximum visual formatting — callouts, tables, checklists, blockquotes. |
|
quiz | Multiple-choice questions with answer key. |
Model Sizes
INLINECODE49 is the default and handles nearly everything. Only use larger models for heavy accents, poor audio quality, or maximum accuracy needs.
| Model | Speed | Accuracy |
|---|
| tiny | Fastest | Excellent (default) |
| base |
Fast | Excellent |
| small | Medium | Superior |
| medium | Slow | Outstanding |
| large | Slowest | Maximum |
File Paths
Augent reads and writes to these locations on your machine:
| Path | Purpose |
|---|
| INLINECODE50 | Default directory for downloaded audio files |
| INLINECODE51 |
Default directory for notes, clips, and TTS output |
|
~/.augent/memory/transcriptions.db | SQLite database for persistent transcription memory |
|
~/.augent/memory/transcriptions/ | Markdown files for each stored transcription |
|
~/.augent/config.yaml | User configuration (optional) |
|
~/.augent/auth.json | Twitter/X authentication cookies for Spaces recording (optional, user-created) |
If Obsidian is installed, visual frames are saved to the Obsidian vault's External Files/visual/ directory. The vault path is auto-detected from Obsidian's config.
Network Access
Network access is used for two purposes only:
- 1. Downloading media from user-provided URLs via yt-dlp and aria2c
- Downloading ML models on first use (Whisper, sentence-transformers, pyannote, Demucs, Kokoro) from Hugging Face
No telemetry. No background network activity. No data is uploaded.
ML Dependencies
The augent[all] install includes these local ML components:
| Component | Purpose | Size |
|---|
| faster-whisper | Speech-to-text transcription | ~75MB (tiny model) |
| sentence-transformers |
Semantic search, auto-tagging, chapter detection | ~90MB |
| pyannote-audio | Speaker diarization | ~29MB |
| Demucs v4 | Audio source separation (vocals from noise) | ~80MB |
| Kokoro | Text-to-speech (54 voices, 9 languages) | ~200MB |
All models run locally. None require API keys or cloud services.
Requirements
- - Python 3.10+
- FFmpeg (audio processing)
- yt-dlp + aria2c (for audio downloads)
Links
Augent — 面向AI代理的音频与视频智能
Augent是一个MCP服务器,为您的代理提供22种音频和视频智能工具。通过yt-dlp和aria2c从1000多个网站下载,通过faster-whisper支持99种语言的转录,通过sentence-transformers进行关键词或语义搜索,做笔记,通过pyannote-audio识别说话人,检测章节,通过Demucs v4分离音频,导出片段,提取视觉帧,录制X/Twitter Spaces(需要在~/.augent/auth.json中配置用户认证令牌),并通过Kokoro TTS生成语音。所有处理均在本地运行。下载文件保存到~/Downloads/,笔记和片段保存到~/Desktop/,转录记忆保存到~/.augent/memory/。
配置
json
{
mcpServers: {
augent: {
command: augent-mcp
}
}
}
如果augent-mcp不在PATH中,请使用python3 -m augent.mcp作为命令替代。
安装
通过上方的ClawHub安装按钮进行安装,或使用uv tool install augent安装基础包,或使用uv tool install augent[all]安装所有功能。音频处理需要FFmpeg。
工具
Augent提供22个MCP工具:
核心
| 工具 | 描述 |
|---|
| downloadaudio | 以最快速度从视频URL下载音频。支持YouTube、Vimeo、TikTok、Twitter/X、SoundCloud及1000多个网站。使用aria2c多连接+并发分片。 |
| transcribeaudio |
对任何音频文件进行完整转录,包含每段的时间戳。返回文本、语言、时长和段落。按文件哈希缓存。 |
| search_audio | 在音频中搜索关键词。返回带时间戳的匹配结果及上下文片段。支持片段导出。 |
| deep_search | 语义搜索——通过含义而非仅关键词查找时刻。使用sentence-transformers嵌入。 |
| search_memory | 在单个查询中搜索所有存储的转录。支持关键词或语义模式。 |
| take_notes | 一站式操作:从URL下载音频、转录并保存格式化笔记。支持5种风格:tldr、notes、highlight、eye-candy、quiz。 |
| clip_export | 从任意URL导出指定时间范围的视频片段。仅下载请求的片段。 |
分析
| 工具 | 描述 |
|---|
| chapters | 使用嵌入相似度自动检测带时间戳的主题章节。 |
| search_proximity |
查找两个关键词在彼此附近出现的位置(例如,startup在funding的30个词范围内)。 |
| identify_speakers | 说话人分离——识别谁在何时说话。无需API密钥。 |
| separate_audio | 使用Meta的Demucs v4从音乐/噪音中分离人声。将纯净人声输入转录。 |
| batch_search | 并行搜索多个音频文件。适用于播客库或采访集合。 |
实用工具
| 工具 | 描述 |
|---|
| texttospeech | 使用Kokoro TTS将文本转换为自然语音。54种声音,9种语言。后台运行。 |
| list_files |
列出目录中的媒体文件及大小信息。 |
| list_memories | 按标题、时长和日期浏览所有存储的转录。 |
| memory_stats | 查看记忆统计信息(文件数量、总时长)。 |
| clear_memory | 清除转录记忆以释放磁盘空间。 |
| tag | 在转录上添加、删除或列出标签。用于组织记忆的广泛主题类别。 |
| highlights | 从转录中导出最佳时刻。自动模式选取最佳时刻;聚焦模式查找匹配主题的时刻。 |
| visual | 在关键时刻从视频中提取视觉上下文。支持查询、自动、手动和辅助模式。帧保存到Obsidian仓库。 |
| rebuild_graph | 为所有转录重建Obsidian图谱视图数据。迁移文件、计算维基链接、生成MOC中心。 |
| spaces | 下载或实时录制X/Twitter Spaces。可开始、检查状态或停止录制。 |
使用示例
从视频做笔记
从 https://youtube.com/watch?v=xxx 做笔记
代理调用take_notes,该工具会下载、转录并返回格式化笔记。一次工具调用完成所有操作。
搜索播客主题
搜索这个播客中所有提到AI监管的地方——提供文件路径或URL。
代理使用searchaudio进行精确关键词匹配,或使用deepsearch进行语义匹配(即使没有精确词语也能找到相关讨论)。
转录并识别说话人
转录这个会议录音,告诉我谁说了什么
代理先调用transcribeaudio,然后调用identifyspeakers为每个段落标注说话人。
跨所有转录搜索
搜索我所有转录过的内容中关于融资的提及
代理使用search_memory在所有存储的转录中搜索,无需提供文件路径。
导出片段
剪辑他们讨论定价的部分
代理使用searchaudio或deepsearch找到该时刻,然后使用clip_export提取该片段。
从嘈杂音频中分离人声
这个录音有背景音乐,清理一下并转录
代理调用separateaudio分离人声,然后在纯净人声轨道上调用transcribeaudio。
从文本生成语音
大声朗读这些笔记
代理调用texttospeech生成自然语音的MP3文件。支持多种声音和语言。
笔记风格
使用take_notes时,style参数控制格式:
| 风格 | 描述 |
|---|
| tldr | 尽可能短的摘要。一屏显示。加粗关键术语。 |
| notes |
带嵌套项目符号的清晰章节(默认)。 |
| highlight | 带标注块突出关键见解的笔记,以及带时间戳的引用块。 |
| eye-candy | 最大视觉格式化——标注块、表格、清单、引用块。 |
| quiz | 带答案的多项选择题。 |
模型大小
tiny是默认模型,几乎能处理所有情况。仅在处理重口音、音频质量差或需要最高精度时使用更大的模型。
快 | 优秀 |
| small | 中等 | 卓越 |
| medium | 慢 | 出色 |
| large | 最慢 | 最高 |
文件路径
Augent在您机器上的以下位置读写文件:
| 路径 | 用途 |
|---|
| ~/Downloads/ | 下载音频文件的默认目录 |
| ~/Desktop/ |
笔记、片段和TTS输出的默认目录 |
| ~/.augent/memory/transcriptions.db | 用于持久化转录记忆的SQLite数据库 |
| ~/.augent/memory/transcriptions/ | 每个存储转录的Markdown文件 |
| ~/.augent/config.yaml | 用户配置(可选) |
| ~/.augent/auth.json | 用于Spaces录制的Twitter/X认证cookies(可选,用户创建) |
如果安装了Obsidian,视觉帧将保存到Obsidian仓库的External Files/visual/目录。仓库路径从Obsidian的配置中自动检测。
网络访问
网络访问仅用于两个目的:
- 1. 通过yt-dlp和aria2c从用户提供的URL下载媒体
- 首次使用时从Hugging Face下载ML模型(Whisper、sentence-transformers、pyannote、Demucs、Kokoro)
无遥测。无后台网络活动。不上传任何数据。
ML依赖
augent[all]安装包含以下本地ML组件:
| 组件 | 用途 | 大小 |
|---|
| faster-whisper | 语音转文本转录 | ~75MB(tiny模型) |
| sentence-transformers |
语义搜索、自动标签、章节检测 | ~90MB |
| pyannote-audio | 说话人分离 | ~29MB |
| Demucs v4 | 音频源分离(从噪音中分离人声) | ~80MB |
| Kokoro | 文本转语音(54种声音,9种语言) | ~200MB |
所有模型均在本地运行。无需API密钥或云服务。
要求
- - Python 3.10+
- FFmpeg(音频处理)
- yt-dlp + aria2c(用于音频下载)
链接
- - [GitHub](https://github