Skill Veille - RSS Aggregator
RSS feed aggregator with URL deduplication and topic-based deduplication for OpenClaw agents.
Fetches articles from 20+ configured sources, filters already-seen URLs (TTL 14 days),
and deduplicates articles covering the same story using Jaccard similarity + named entities.
No external dependencies: stdlib Python only (urllib, xml.etree, email.utils).
Trigger phrases
- - "fais une veille"
- "quoi de neuf en securite / tech / crypto / IA ?"
- "donne-moi les news du jour"
- "articles recents sur [sujet]"
- "veille RSS"
- "digest du matin"
- "nouvelles non vues"
Quick Start
CODEBLOCK0
Setup
Requirements
- - Python 3.9+
- Network access to RSS feeds (public, no auth required)
- No pip installs needed
Installation
CODEBLOCK1
The wizard creates:
- -
~/.openclaw/config/veille/config.json (from config.example.json) - INLINECODE2 (data directory)
Customizing sources
Edit ~/.openclaw/config/veille/config.json and add/remove entries in the "sources" dict:
CODEBLOCK2
Storage and credentials
Files written by this skill
| Path | Written by | Purpose | Contains secrets |
|---|
| INLINECODE5 | INLINECODE6 | Sources, outputs, options | NO |
| INLINECODE7 |
veille.py | URL dedup store (TTL 14d) | NO |
|
~/.openclaw/data/veille/topic_seen.json |
veille.py | Topic dedup store (TTL 5d) | NO |
Files read from outside the skill
| Path | Read by | Key accessed | When |
|---|
| INLINECODE11 | INLINECODE12 | INLINECODE13 (read-only) | Only when telegram_bot output is enabled and no bot_token is set in the output config |
This is the only cross-config read. To avoid it entirely, set bot_token explicitly in your output config:
CODEBLOCK3
Output credentials (optional)
Credentials are only used if you enable the corresponding output. None are required for core functionality (RSS fetch + dedup).
| Output | Credential source | What is used |
|---|
| INLINECODE17 | INLINECODE18 or bot_token in output config | Bot token (read-only) |
| INLINECODE20 |
Delegated to mail-client skill (its own creds) | Nothing read directly |
|
mail-client (SMTP fallback) |
smtp_user /
smtp_pass in output config | SMTP login |
|
nextcloud | Delegated to nextcloud-files skill (its own creds) | Nothing read directly |
Cleanup on uninstall
CODEBLOCK4
Security model
Credential isolation
- - API keys are read from dedicated files (default
~/.openclaw/secrets/), never from config.json. The scorer warns at runtime if a key file has overly permissive filesystem permissions. - SMTP credentials (fallback only) are stored in the output config block — use the mail-client skill delegation to avoid storing SMTP passwords.
Subprocess boundaries
- - Dispatch delegates to other OpenClaw skills via
subprocess.run() (never shell=True). Script paths are validated to reside under ~/.openclaw/workspace/skills/ before execution, preventing path traversal. - No credentials are passed as subprocess arguments — each skill manages its own authentication.
File output safety
- - The
file output type validates the target path before writing: only ~/.openclaw/ is allowed by default. Additional directories can be whitelisted via config.security.allowed_output_dirs. Sensitive paths (.ssh, .gnupg, /etc/, .bashrc, etc.) are always blocked regardless of allowlist. - Written content is checked for suspicious patterns (shell shebangs, SSH keys, PGP blocks, code injection) and size-limited to 1 MB.
Cross-config reads
- - The only cross-config file read is
~/.openclaw/openclaw.json for the Telegram bot token, and only when telegram_bot output is enabled without an explicit bot_token. This read is logged to stderr. Set bot_token in the output config to eliminate this read entirely.
Autonomous dispatch
- - When scheduled (cron), the skill can send messages/files to configured outputs without user interaction. All dispatch actions are logged to stderr with an audit summary. Use
enabled: false on any output to disable it without removing its config.
CLI reference
fetch
CODEBLOCK5
Options:
- -
--hours N : lookback window in hours (default: from config, usually 24) - INLINECODE43 : filter already-seen URLs (uses seenurls.json TTL store)
- INLINECODE44 : deduplicate by topic (uses topicseen.json + Jaccard similarity)
- INLINECODE45 : path to custom JSON sources file
Output (JSON on stdout):
CODEBLOCK6
seen-stats
CODEBLOCK7
Shows URL seen store statistics (count, TTL, file path).
topic-stats
CODEBLOCK8
Shows topic deduplication store statistics.
mark-seen
CODEBLOCK9
Marks one or more URLs as already seen (prevents them from appearing in future fetches with --filter-seen).
score
CODEBLOCK10
Reads a digest JSON from stdin (output of fetch) and scores articles using an OpenAI-compatible LLM.
Returns enriched JSON with scored, ghost_picks, and per-article score/reason fields.
Options:
- -
--dry-run : print summary on stderr without calling the LLM API
When llm.enabled is false (default), articles pass through unchanged ("scored": false).
Pipeline usage:
CODEBLOCK11
send
CODEBLOCK12
Reads a digest JSON from stdin and dispatches to all enabled outputs configured in config.json.
Accepts both raw fetch output (articles key) and LLM-processed digests (categories key).
Output types: telegram_bot, mail-client, nextcloud, file.
- -
telegram_bot: bot token auto-read from OpenClaw config - no extra setup if Telegram already configured. - INLINECODE69 : delegates to mail-client skill if installed, falls back to raw SMTP config.
- INLINECODE70 : delegates to nextcloud-files skill if installed (append mode by default with date separator).
- INLINECODE71 : writes digest to a local file. Path must be under
~/.openclaw/ (default) or a directory listed in config.security.allowed_output_dirs. Sensitive paths and suspicious content are blocked (see Security model).
Configure outputs interactively:
CODEBLOCK13
config
CODEBLOCK14
Prints the active configuration (no secrets).
LLM scoring configuration
The llm key in config.json controls the optional LLM-based article scoring:
CODEBLOCK15
| Key | Default | Description |
|---|
| INLINECODE77 | INLINECODE78 | Enable LLM scoring (requires API key) |
| INLINECODE79 |
https://api.openai.com/v1 | OpenAI-compatible API endpoint |
|
api_key_file |
~/.openclaw/secrets/openai_api_key | Path to file containing the API key |
|
model |
gpt-4o-mini | Model to use for scoring |
|
top_n |
10 | Max articles to send to LLM per batch |
|
ghost_threshold |
5 | Score threshold for
ghost_picks (blog-worthy articles) |
Scoring rules:
- - Only the first
top_n articles are sent to the LLM. Articles beyond INLINECODE91
are excluded from the digest entirely.
fetch returns articles sorted by date
desc, so
top_n selects the most recent ones. Increase
top_n to evaluate
more articles per run (higher token cost).
- - Score >=
ghost_threshold : added to ghost_picks list - Score >= 3 : kept in
articles list - Score <= 2 : excluded from output
- Articles are sorted by score (descending)
When disabled, the score subcommand passes data through unchanged.
Nextcloud output mode
The nextcloud output now defaults to append mode with a date separator. Each dispatch adds content below a ## YYYY-MM-DD HH:MM header, preserving previous entries.
Set "mode": "overwrite" in the output config to restore the old behavior:
CODEBLOCK16
File output configuration
The file output writes digests to the local filesystem. By default, only paths under ~/.openclaw/ are allowed. To authorize additional directories, use config.security.allowed_output_dirs:
CODEBLOCK17
Blocked paths (always rejected, even if inside an allowed directory):
.ssh, .gnupg, .config/systemd, crontab, /etc/, .bashrc, .profile, .bash_profile, .zshrc, INLINECODE113
Content validation — written content is rejected if it:
- - Exceeds 1 MB
- Contains shell shebangs (
#!/), SSH keys, PGP blocks, or code injection patterns (eval(, exec(, __import__(, import os, import subprocess)
All blocked attempts are logged to stderr with the reason.
Templates (agent usage)
Basic digest
CODEBLOCK18
Prompt template
CODEBLOCK19
Agent workflow example
CODEBLOCK20
Pipeline (CLI)
CODEBLOCK21
Filtering by keyword (post-fetch)
CODEBLOCK22
Ideas
- - Add keyword-based filtering (
--keywords security,cve,linux) - Add per-source TTL override in config
- Export digest as HTML or Markdown
- Schedule with cron: INLINECODE121
- Weight articles by source tier for LLM prioritization
- Add OPML import/export for source list management
- Integrate with ntfy or Telegram for real-time alerts on high-priority articles
Combine with
- - mail-client : send the digest by email after fetching
CODEBLOCK23
- - nextcloud-files : archive the daily digest as a Markdown file
veille fetch --filter-seen | jq .wrapped_listing -r > /tmp/digest.md
nextcloud-files upload /tmp/digest.md /Digests/$(date +%Y-%m-%d).md
Troubleshooting
See references/troubleshooting.md for detailed troubleshooting steps.
Common issues:
- - No articles returned: check
--hours value, verify feed URLs in config - XML parse error on a feed: some feeds use non-standard XML; the skill skips broken items silently
- All articles filtered as seen: run
seen-stats to check store size; reset with INLINECODE125 - Import error: ensure you run
veille.py from its directory or via full path - File output blocked: path is outside
~/.openclaw/ — add the target directory to config.security.allowed_output_dirs (see File output configuration)
技能 Veille - RSS 聚合器
面向 OpenClaw 代理的 RSS 订阅聚合器,具备 URL 去重和基于主题的去重功能。
从 20 多个已配置源获取文章,过滤已见过的 URL(TTL 14 天),
并使用 Jaccard 相似度 + 命名实体对报道同一故事的文章进行去重。
无外部依赖:仅使用标准库 Python(urllib、xml.etree、email.utils)。
触发短语
- - 进行一次信息监测
- 安全/科技/加密货币/人工智能方面有什么新消息?
- 给我今天的新闻
- 关于[主题]的最新文章
- RSS 监测
- 早间摘要
- 未读新闻
快速开始
bash
1. 设置
python3 scripts/setup.py
2. 验证
python3 scripts/init.py
3. 获取 + 评分 + 发送(完整流程)
python3 scripts/veille.py fetch --filter-seen --filter-topic \
| python3 scripts/veille.py score \
| python3 scripts/veille.py send
设置
系统要求
- - Python 3.9+
- 可访问 RSS 订阅源(公开,无需认证)
- 无需 pip 安装
安装
bash
从技能目录执行
python3 scripts/setup.py
验证
python3 scripts/init.py
向导会创建:
- - ~/.openclaw/config/veille/config.json(基于 config.example.json)
- ~/.openclaw/data/veille/(数据目录)
自定义源
编辑 ~/.openclaw/config/veille/config.json,在 sources 字典中添加/删除条目:
json
{
sources: {
我的博客: https://example.com/feed.xml,
BleepingComputer: https://www.bleepingcomputer.com/feed/
}
}
存储与凭据
本技能写入的文件
| 路径 | 写入者 | 用途 | 包含机密 |
|---|
| ~/.openclaw/config/veille/config.json | setup.py | 源、输出、选项 | 否 |
| ~/.openclaw/data/veille/seen_urls.json |
veille.py | URL 去重存储(TTL 14天) | 否 |
| ~/.openclaw/data/veille/topic_seen.json | veille.py | 主题去重存储(TTL 5天) | 否 |
从技能外部读取的文件
| 路径 | 读取者 | 访问的键 | 时机 |
|---|
| ~/.openclaw/openclaw.json | dispatch.py | channels.telegram.botToken(只读) | 仅当启用了 telegrambot 输出且输出配置中未设置 bottoken 时 |
这是唯一的跨配置读取。要完全避免此操作,请在输出配置中显式设置 bot_token:
json
{ type: telegrambot, bottoken: 你的机器人令牌, chat_id: ..., enabled: true }
输出凭据(可选)
凭据仅在启用相应输出时使用。核心功能(RSS 获取 + 去重)不需要任何凭据。
| 输出 | 凭据来源 | 使用内容 |
|---|
| telegrambot | ~/.openclaw/openclaw.json 或输出配置中的 bottoken | 机器人令牌(只读) |
| mail-client |
委托给 mail-client 技能(其自有凭据) | 不直接读取任何内容 |
| mail-client(SMTP 回退) | 输出配置中的 smtp
user / smtppass | SMTP 登录 |
| nextcloud | 委托给 nextcloud-files 技能(其自有凭据) | 不直接读取任何内容 |
卸载时清理
bash
python3 scripts/setup.py --cleanup
安全模型
凭据隔离
- - API 密钥从专用文件(默认 ~/.openclaw/secrets/)读取,绝不从 config.json 读取。评分器在运行时如果密钥文件的文件系统权限过于宽松会发出警告。
- SMTP 凭据(仅回退)存储在输出配置块中——使用 mail-client 技能委托可避免存储 SMTP 密码。
子进程边界
- - Dispatch 通过 subprocess.run()(从不使用 shell=True)委托给其他 OpenClaw 技能。脚本路径在执行前会验证是否位于 ~/.openclaw/workspace/skills/ 下,防止路径遍历。
- 凭据不会作为子进程参数传递——每个技能管理自己的认证。
文件输出安全
- - file 输出类型在写入前验证目标路径:默认只允许 ~/.openclaw/。可通过 config.security.allowedoutputdirs 将其他目录加入白名单。无论白名单如何,敏感路径(.ssh、.gnupg、/etc/、.bashrc 等)始终被阻止。
- 写入的内容会检查可疑模式(shell shebang、SSH 密钥、PGP 块、代码注入)并限制大小为 1 MB。
跨配置读取
- - 唯一的跨配置文件读取是 ~/.openclaw/openclaw.json 中的 Telegram 机器人令牌,且仅当启用了 telegrambot 输出且未设置显式 bottoken 时。此读取会记录到 stderr。在输出配置中设置 bot_token 可完全消除此读取。
自主分发
- - 当按计划(cron)运行时,技能可以在无需用户交互的情况下向配置的输出发送消息/文件。所有分发操作都会记录到 stderr 并附带审计摘要。在任何输出上使用 enabled: false 可禁用它而无需移除其配置。
CLI 参考
fetch
python3 veille.py fetch [--hours N] [--filter-seen] [--filter-topic] [--sources FILE]
选项:
- - --hours N:回溯窗口小时数(默认:来自配置,通常为 24)
- --filter-seen:过滤已见过的 URL(使用 seenurls.json TTL 存储)
- --filter-topic:按主题去重(使用 topicseen.json + Jaccard 相似度)
- --sources FILE:自定义 JSON 源文件的路径
输出(stdout 上的 JSON):
json
{
hours: 24,
count: 42,
skipped_url: 5,
skipped_topic: 3,
articles: [...],
wrapped_listing: === 不可信的外部内容 ...
}
seen-stats
python3 veille.py seen-stats
显示 URL 已见存储的统计信息(数量、TTL、文件路径)。
topic-stats
python3 veille.py topic-stats
显示主题去重存储的统计信息。
mark-seen
python3 veille.py mark-seen URL [URL ...]
将一个或多个 URL 标记为已见(阻止它们在未来的 --filter-seen 获取中出现)。
score
python3 veille.py score [--dry-run]
从 stdin 读取摘要 JSON(fetch 的输出)并使用兼容 OpenAI 的 LLM 对文章进行评分。
返回带有 scored、ghost_picks 以及每篇文章的 score/reason 字段的增强 JSON。
选项:
- - --dry-run:在 stderr 上打印摘要而不调用 LLM API
当 llm.enabled 为 false(默认)时,文章原样通过(scored: false)。
管道用法:
bash
python3 veille.py fetch --filter-seen --filter-topic | python3 veille.py score | python3 veille.py send
send
python3 veille.py send [--profile NAME]
从 stdin 读取摘要 JSON 并分发到 config.json 中配置的所有已启用输出。
接受原始获取输出(articles 键)和 LLM 处理的摘要(categories 键)。
输出类型:telegram_bot、mail-client、nextcloud、file。
- - telegram_bot:机器人令牌自动从 OpenClaw 配置读取——如果已配置 Telegram 则无需额外设置。
- mail-client:如果已安装则委托给 mail-client 技能,否则回退到原始 SMTP 配置。
- nextcloud:如果已