ClawText Ingest — Production-Ready Memory Ingestion
Version: 1.3.0 | License: MIT | Status: Production ✅
Author: ragesaq | Category: Memory & Knowledge Management
GitHub: https://github.com/ragesaq/clawtext-ingest
🎯 What It Does
ClawText Ingest transforms external data (Discord forums, files, URLs, JSON, text) into structured, deduplicated memories for AI agents.
The Problem It Solves
- - ❌ Manual ingestion — Tedious, error-prone, no metadata
- ❌ Duplicate memories — Same data ingested multiple times
- ❌ Unstructured data — No hierarchy, no context preservation
- ❌ One-time imports — No recurring/scheduled ingestion
- ❌ Discord-specific gaps — Can't preserve forum post↔reply structure
The Solution
✅ One command imports from Discord, files, URLs, or JSON
✅ 100% idempotent — Run 1000x, zero duplicates
✅ Automatic metadata — YAML frontmatter with date, project, type, entities
✅ 6 agent patterns — Autonomous workflows documented and ready
✅ Discord-native — Forum hierarchy preserved, progress bars, auto-batch mode
✨ Key Features
🎯 Discord Integration (New in v1.3.0)
- - Forum + Channel + Thread support
- Hierarchy preservation — Post↔reply structure in metadata
- Real-time progress — Live feedback for large ingestions
- Auto-batch mode — <500 posts: full, ≥500 posts: streaming
- One-command setup — 5-minute bot creation
📁 Multi-Source Ingestion
- - Files — Glob patterns (Markdown, text, etc.)
- URLs — Single or bulk URL ingestion
- JSON — Chat exports, API responses
- Raw text — Quick knowledge capture
- Batch operations — Unified ingestion from multiple sources
🔄 Deduplication & Safety
- - SHA1-based — Cryptographic hash matching
- 100% idempotent — Safe for repeated runs
- Configurable —
checkDedupe: true/false per operation - Zero data loss — Failed items tracked, fallback per-item ingestion
- Hash persistence —
.ingest_hashes.json for cross-session tracking
🤖 Agent-Ready
- - 6 documented patterns — Direct API, Discord Agent, CLI, Cron, Batch, Thread
- Working code examples — Copy-paste ready
- Real-world patterns — GitHub sync, Discord monitoring, team decisions
- Error handling — Comprehensive error recovery
- Progress callbacks — Track ingestion in real-time
🛠️ Developer-Friendly
- - CLI tool —
clawtext-ingest + clawtext-ingest-discord commands - Node.js API — Simple imports for programmatic use
- TypeScript-ready — Clear method signatures
- Extensible — Custom transforms, field mapping
- Well-documented — 11 guides, 20+ examples
🔗 ClawText Integration
- - Automatic cluster indexing — New memories indexed after rebuild
- RAG injection — Relevant context injected into agent prompts
- Project routing — Organize memories by project/source
- Entity linking — Auto-extract and link related entities
🚀 Quick Start
Installation
CODEBLOCK0
Discord Ingestion (5 minutes)
CODEBLOCK1
File Ingestion
CODEBLOCK2
Node.js API
CODEBLOCK3
🤖 Agent Integration (6 Patterns)
Pattern 1: Direct API
For: In-agent code
Use when: Agents need to ingest as part of workflow
CODEBLOCK4
Pattern 2: Discord Agent
For: Autonomous Discord ingestion
Use when: Agents need to fetch Discord forums
CODEBLOCK5
Pattern 3: CLI Subprocess
For: Agents executing commands
Use when: Simpler CLI-based execution needed
CODEBLOCK6
Pattern 4: Cron/Scheduled
For: Recurring tasks
Use when: Daily/hourly ingestion needed
CODEBLOCK7
Pattern 5: Batch Multi-Source
For: Unified ingestion
Use when: Multiple sources in one operation
CODEBLOCK8
Pattern 6: Discord Thread
For: Thread-specific ingestion
Use when: Single thread fetch needed
CODEBLOCK9
→ See AGENT_GUIDE.md for complete examples
📊 Real-World Examples
Example 1: Daily Documentation Sync
CODEBLOCK10
Example 2: Discord Forum Monitoring
CODEBLOCK11
Example 3: Team Decisions Ingestion
CODEBLOCK12
🛒 CLI Commands
clawtext-ingest — File/URL/JSON/Text Ingestion
CODEBLOCK13
clawtext-ingest-discord — Discord Integration
CODEBLOCK14
📚 Documentation
5-minute setup | 5 min |
|
AGENT_GUIDE.md | 6 autonomous patterns | 10 min |
|
API_REFERENCE.md | Complete API docs | 15 min |
|
PHASE2CLI_GUIDE.md | CLI commands | 10 min |
|
DISCORDBOT_SETUP.md | Bot creation | 5 min |
|
CLAYHUB_GUIDE.md | Publication | 5 min |
|
INDEX.md | Documentation index | 2 min |
🎯 Who Should Use This
- - ✅ AI/Agent developers — Building knowledge-aware agents
- ✅ RAG engineers — Populating memory for context injection
- ✅ Teams using Discord — Leveraging Discord as knowledge base
- ✅ DevOps/MLOps — Automated knowledge ingestion pipelines
- ✅ Researchers — Structuring unstructured data sources
⚡ Performance
| Operation | Speed | Notes |
|---|
| Ingest 100 files | ~5 sec | With SHA1 dedup check |
| Ingest 1000 JSON items |
~15 sec | Batch processing |
| Small forum (<100 msgs) | ~10 sec | Full mode |
| Large forum (1000+ msgs) | ~2 min | Auto-batch, streaming |
| Rebuild clusters | ~5-30 sec | Depends on total memories |
✅ Quality Metrics
| Metric | Value |
|---|
| Tests | 22/22 passing ✅ |
| Code |
1,254 production lines |
|
Documentation | 92 KB across 11 guides |
|
Examples | 20+ working examples |
|
Coverage | 100% critical paths |
🔗 Integration with ClawText
- 1. Ingest data → Creates memories with YAML metadata
- Rebuild clusters → ClawText indexes new memories
- RAG layer → Relevant context injected on next prompt
- Agent response — Enhanced with contextual information
CODEBLOCK15
🆘 Support
- - Documentation: See INDEX.md for navigation
- Issues: https://github.com/ragesaq/clawtext-ingest/issues
- Examples: 20+ examples in documentation
- Troubleshooting: Built into each guide
📦 Installation & Requirements
Requirements:
- - Node.js ≥ 18.0.0
- OpenClaw (for agent patterns)
- ClawText ≥ 1.2.0 (for RAG integration)
Installation:
CODEBLOCK16
Binaries:
- -
clawtext-ingest — File/URL/JSON ingestion - INLINECODE7 — Discord integration
🚀 Why This Over Alternatives
| Feature | ClawText-Ingest | Manual | Generic Importer | API Tool |
|---|
| Discord native | ✅ | ❌ | ❌ | ❌ |
| Deduplication |
✅ | ❌ | Partial | ❌ |
| Agent patterns | ✅ | ❌ | ❌ | ❌ |
| Metadata auto | ✅ | ❌ | Partial | ❌ |
| ClawText integration | ✅ | ❌ | ❌ | ❌ |
| Idempotent | ✅ | ❌ | ❌ | Partial |
📄 License
MIT — Use freely, open source, community supported
🙌 Contributing
Contributions welcome! See GitHub issues for current priorities.
Ready to ingest? Start with QUICKSTART.md (5 min) or AGENTGUIDE.md if you're building agents.
ClawText Ingest — 生产级记忆摄取
版本: 1.3.0 | 许可证: MIT | 状态: 生产就绪 ✅
作者: ragesaq | 类别: 记忆与知识管理
GitHub: https://github.com/ragesaq/clawtext-ingest
🎯 功能概述
ClawText Ingest 将外部数据(Discord 论坛、文件、URL、JSON、文本)转换为结构化的、去重后的记忆,供 AI 智能体使用。
解决的问题
- - ❌ 手动摄取 — 繁琐、易错、无元数据
- ❌ 重复记忆 — 同一数据被多次摄取
- ❌ 非结构化数据 — 无层级结构、无上下文保留
- ❌ 一次性导入 — 无定期/定时摄取
- ❌ Discord 特有缺陷 — 无法保留论坛帖子↔回复结构
解决方案
✅ 一条命令即可从 Discord、文件、URL 或 JSON 导入
✅ 100% 幂等 — 运行 1000 次,零重复
✅ 自动元数据 — 包含日期、项目、类型、实体的 YAML 前置元数据
✅ 6 种智能体模式 — 自主工作流已文档化且可直接使用
✅ Discord 原生支持 — 保留论坛层级结构、进度条、自动批处理模式
✨ 核心特性
🎯 Discord 集成(v1.3.0 新增)
- - 支持论坛 + 频道 + 线程
- 层级结构保留 — 元数据中的帖子↔回复结构
- 实时进度 — 大规模摄取时的实时反馈
- 自动批处理模式 — <500 条帖子:完整模式,≥500 条帖子:流式模式
- 一键设置 — 5 分钟创建机器人
📁 多源摄取
- - 文件 — Glob 模式(Markdown、文本等)
- URL — 单个或批量 URL 摄取
- JSON — 聊天导出、API 响应
- 原始文本 — 快速知识捕获
- 批量操作 — 从多个来源统一摄取
🔄 去重与安全
- - 基于 SHA1 — 加密哈希匹配
- 100% 幂等 — 可安全重复运行
- 可配置 — 每次操作可设置 checkDedupe: true/false
- 零数据丢失 — 跟踪失败项,逐项回退摄取
- 哈希持久化 — .ingest_hashes.json 用于跨会话跟踪
🤖 智能体就绪
- - 6 种文档化模式 — 直接 API、Discord 智能体、CLI、Cron、批量、线程
- 可运行代码示例 — 复制即用
- 真实场景模式 — GitHub 同步、Discord 监控、团队决策
- 错误处理 — 全面的错误恢复
- 进度回调 — 实时跟踪摄取进度
🛠️ 开发者友好
- - CLI 工具 — clawtext-ingest + clawtext-ingest-discord 命令
- Node.js API — 简单导入,便于编程使用
- TypeScript 就绪 — 清晰的方法签名
- 可扩展 — 自定义转换、字段映射
- 文档完善 — 11 份指南,20+ 示例
🔗 ClawText 集成
- - 自动集群索引 — 重建后新记忆自动索引
- RAG 注入 — 相关上下文注入到智能体提示中
- 项目路由 — 按项目/来源组织记忆
- 实体链接 — 自动提取并链接相关实体
🚀 快速开始
安装
bash
通过 npm
npm install clawtext-ingest
通过 OpenClaw
openclaw install clawtext-ingest
Discord 摄取(5 分钟)
bash
1. 设置 Discord 机器人(参见 DISCORDBOTSETUP.md)
2. 获取机器人令牌,设置 DISCORD_TOKEN 环境变量
3. 检查论坛
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose
4. 带进度条摄取
DISCORD
TOKEN=xxx clawtext-ingest-discord fetch-discord --forum-id FORUMID
5. 重建 ClawText 集群
clawtext-ingest rebuild
文件摄取
bash
clawtext-ingest ingest-files --input=docs/*.md --project=docs
Node.js API
javascript
import { ClawTextIngest } from clawtext-ingest;
const ingest = new ClawTextIngest();
// 摄取文件
await ingest.fromFiles([docs//*.md], { project: docs, type: fact });
// 摄取 JSON
await ingest.fromJSON(chatArray, { project: team }, {
keyMap: { contentKey: message, dateKey: timestamp, authorKey: user }
});
// 重建集群以用于 RAG 注入
await ingest.rebuildClusters();
🤖 智能体集成(6 种模式)
模式 1:直接 API
适用场景: 智能体内部代码
使用时机: 智能体需要在工作流中摄取数据
javascript
const ingest = new ClawTextIngest();
await ingest.fromFiles([docs//*.md], { project: docs });
模式 2:Discord 智能体
适用场景: 自主 Discord 摄取
使用时机: 智能体需要获取 Discord 论坛数据
javascript
const runner = new DiscordIngestionRunner(ingest);
await runner.ingestForumAutonomous({
forumId, mode: batch, token: process.env.DISCORD_TOKEN
});
模式 3:CLI 子进程
适用场景: 智能体执行命令
使用时机: 需要更简单的基于 CLI 的执行方式
javascript
await execAsync(clawtext-ingest-discord fetch-discord --forum-id ID);
模式 4:Cron/定时任务
适用场景: 重复性任务
使用时机: 需要每日/每小时摄取
javascript
cron.schedule(0 , () => agentIngest());
模式 5:批量多源
适用场景: 统一摄取
使用时机: 一次操作涉及多个来源
javascript
await ingest.ingestAll([
{ type: files, data: [docs//*.md], metadata: {...} },
{ type: json, data: chatExport, metadata: {...} }
]);
模式 6:Discord 线程
适用场景: 特定线程摄取
使用时机: 需要获取单个线程
javascript
await runner.ingestThread(threadId);
→ 完整示例请参见 AGENT_GUIDE.md
📊 真实场景示例
示例 1:每日文档同步
javascript
async function syncDocsDaily() {
const ingest = new ClawTextIngest();
const result = await ingest.ingestAll([
{ type: files, data: [docs//*.md], metadata: { project: docs } },
{ type: urls, data: [https://docs.example.com/api], metadata: { project: api-docs } }
]);
await ingest.rebuildClusters();
return result;
}
示例 2:Discord 论坛监控
javascript
async function monitorDiscordForum(forumId) {
const ingest = new ClawTextIngest();
const runner = new DiscordIngestionRunner(ingest);
const result = await runner.ingestForumAutonomous({
forumId,
mode: batch,
token: process.env.DISCORD_TOKEN,
onProgress: (p) => console.log(${p.percent}% complete...)
});
return result;
}
示例 3:团队决策摄取
javascript
async function ingestTeamDecisions() {
const ingest = new ClawTextIngest();
const result = await ingest.ingestAll([
{ type: files, data: [decisions/adr//*.md], metadata: { type: adr } },
{ type: json, data: slackThread, metadata: { type: decision, source: slack } }
]);
await ingest.rebuildClusters();
return result;
}
🛒 CLI 命令
clawtext