Skill Veille - RSS Aggregator

RSS feed aggregator with URL deduplication and topic-based deduplication for OpenClaw agents.
Fetches articles from 20+ configured sources, filters already-seen URLs (TTL 14 days),
and deduplicates articles covering the same story using Jaccard similarity + named entities.

No external dependencies: stdlib Python only (urllib, xml.etree, email.utils).

Trigger phrases

- "fais une veille"
"quoi de neuf en securite / tech / crypto / IA ?"
"donne-moi les news du jour"
"articles recents sur [sujet]"
"veille RSS"
"digest du matin"
"nouvelles non vues"

Quick Start

CODEBLOCK0

Setup

Requirements

- Python 3.9+
Network access to RSS feeds (public, no auth required)
No pip installs needed

Installation

CODEBLOCK1

The wizard creates:

- ~/.openclaw/config/veille/config.json (from config.example.json)
INLINECODE2 (data directory)

Customizing sources

Edit ~/.openclaw/config/veille/config.json and add/remove entries in the "sources" dict:

CODEBLOCK2

Storage and credentials

Files written by this skill

Path	Written by	Purpose	Contains secrets
INLINECODE5	INLINECODE6	Sources, outputs, options	NO
INLINECODE7

Files read from outside the skill

Path	Read by	Key accessed	When
INLINECODE11	INLINECODE12	INLINECODE13 (read-only)	Only when `telegram_bot` output is enabled and no `bot_token` is set in the output config

This is the only cross-config read. To avoid it entirely, set bot_token explicitly in your output config:

CODEBLOCK3

Output credentials (optional)

Credentials are only used if you enable the corresponding output. None are required for core functionality (RSS fetch + dedup).

Output	Credential source	What is used
INLINECODE17	INLINECODE18 or `bot_token` in output config	Bot token (read-only)
INLINECODE20

Cleanup on uninstall

CODEBLOCK4

Security model

Credential isolation

- API keys are read from dedicated files (default ~/.openclaw/secrets/), never from config.json. The scorer warns at runtime if a key file has overly permissive filesystem permissions.
SMTP credentials (fallback only) are stored in the output config block — use the mail-client skill delegation to avoid storing SMTP passwords.

Subprocess boundaries

- Dispatch delegates to other OpenClaw skills via subprocess.run() (never shell=True). Script paths are validated to reside under ~/.openclaw/workspace/skills/ before execution, preventing path traversal.
No credentials are passed as subprocess arguments — each skill manages its own authentication.

File output safety

- The file output type validates the target path before writing: only ~/.openclaw/ is allowed by default. Additional directories can be whitelisted via config.security.allowed_output_dirs. Sensitive paths (.ssh, .gnupg, /etc/, .bashrc, etc.) are always blocked regardless of allowlist.
Written content is checked for suspicious patterns (shell shebangs, SSH keys, PGP blocks, code injection) and size-limited to 1 MB.

Cross-config reads

- The only cross-config file read is ~/.openclaw/openclaw.json for the Telegram bot token, and only when telegram_bot output is enabled without an explicit bot_token. This read is logged to stderr. Set bot_token in the output config to eliminate this read entirely.

Autonomous dispatch

- When scheduled (cron), the skill can send messages/files to configured outputs without user interaction. All dispatch actions are logged to stderr with an audit summary. Use enabled: false on any output to disable it without removing its config.

CLI reference

`fetch`

CODEBLOCK5

Options:

- --hours N : lookback window in hours (default: from config, usually 24)
INLINECODE43 : filter already-seen URLs (uses seenurls.json TTL store)
INLINECODE44 : deduplicate by topic (uses topicseen.json + Jaccard similarity)
INLINECODE45 : path to custom JSON sources file

Output (JSON on stdout):
CODEBLOCK6

`seen-stats`

CODEBLOCK7

Shows URL seen store statistics (count, TTL, file path).

`topic-stats`

CODEBLOCK8

Shows topic deduplication store statistics.

`mark-seen`

CODEBLOCK9

Marks one or more URLs as already seen (prevents them from appearing in future fetches with --filter-seen).

`score`

CODEBLOCK10

Reads a digest JSON from stdin (output of fetch) and scores articles using an OpenAI-compatible LLM.
Returns enriched JSON with scored, ghost_picks, and per-article score/reason fields.

Options:

- --dry-run : print summary on stderr without calling the LLM API

When llm.enabled is false (default), articles pass through unchanged ("scored": false).

Pipeline usage:
CODEBLOCK11

`send`

CODEBLOCK12

Reads a digest JSON from stdin and dispatches to all enabled outputs configured in config.json.
Accepts both raw fetch output (articles key) and LLM-processed digests (categories key).

Output types: telegram_bot, mail-client, nextcloud, file.

- telegram_bot: bot token auto-read from OpenClaw config - no extra setup if Telegram already configured.
INLINECODE69: delegates to mail-client skill if installed, falls back to raw SMTP config.
INLINECODE70: delegates to nextcloud-files skill if installed (append mode by default with date separator).
INLINECODE71: writes digest to a local file. Path must be under ~/.openclaw/ (default) or a directory listed in config.security.allowed_output_dirs. Sensitive paths and suspicious content are blocked (see Security model).

Configure outputs interactively:
CODEBLOCK13

`config`

CODEBLOCK14

Prints the active configuration (no secrets).

LLM scoring configuration

The llm key in config.json controls the optional LLM-based article scoring:

CODEBLOCK15

Key	Default	Description
INLINECODE77	INLINECODE78	Enable LLM scoring (requires API key)
INLINECODE79

Scoring rules:

- Only the first top_n articles are sent to the LLM. Articles beyond INLINECODE91

are excluded from the digest entirely. fetch returns articles sorted by date
desc, so top_n selects the most recent ones. Increase top_n to evaluate
more articles per run (higher token cost).

- Score >= ghost_threshold : added to ghost_picks list
Score >= 3 : kept in articles list
Score <= 2 : excluded from output
Articles are sorted by score (descending)

When disabled, the score subcommand passes data through unchanged.

Nextcloud output mode

The nextcloud output now defaults to append mode with a date separator. Each dispatch adds content below a ## YYYY-MM-DD HH:MM header, preserving previous entries.

Set "mode": "overwrite" in the output config to restore the old behavior:

CODEBLOCK16

File output configuration

The file output writes digests to the local filesystem. By default, only paths under ~/.openclaw/ are allowed. To authorize additional directories, use config.security.allowed_output_dirs:

CODEBLOCK17

Blocked paths (always rejected, even if inside an allowed directory):
.ssh, .gnupg, .config/systemd, crontab, /etc/, .bashrc, .profile, .bash_profile, .zshrc, INLINECODE113

Content validation — written content is rejected if it:

- Exceeds 1 MB
Contains shell shebangs (#!/), SSH keys, PGP blocks, or code injection patterns (eval(, exec(, __import__(, import os, import subprocess)

All blocked attempts are logged to stderr with the reason.

Templates (agent usage)

Basic digest

CODEBLOCK18

Prompt template

CODEBLOCK19

Agent workflow example

CODEBLOCK20

Pipeline (CLI)

CODEBLOCK21

Filtering by keyword (post-fetch)

CODEBLOCK22

Ideas

- Add keyword-based filtering (--keywords security,cve,linux)
Add per-source TTL override in config
Export digest as HTML or Markdown
Schedule with cron: INLINECODE121
Weight articles by source tier for LLM prioritization
Add OPML import/export for source list management
Integrate with ntfy or Telegram for real-time alerts on high-priority articles

Combine with

- mail-client : send the digest by email after fetching

CODEBLOCK23

- nextcloud-files : archive the daily digest as a Markdown file

  veille fetch --filter-seen | jq .wrapped_listing -r > /tmp/digest.md
  nextcloud-files upload /tmp/digest.md /Digests/$(date +%Y-%m-%d).md

Troubleshooting

See references/troubleshooting.md for detailed troubleshooting steps.

Common issues:

- No articles returned: check --hours value, verify feed URLs in config
XML parse error on a feed: some feeds use non-standard XML; the skill skips broken items silently
All articles filtered as seen: run seen-stats to check store size; reset with INLINECODE125
Import error: ensure you run veille.py from its directory or via full path
File output blocked: path is outside ~/.openclaw/ — add the target directory to config.security.allowed_output_dirs (see File output configuration)

技能 Veille - RSS 聚合器

面向 OpenClaw 代理的 RSS 订阅聚合器，具备 URL 去重和基于主题的去重功能。
从 20 多个已配置源获取文章，过滤已见过的 URL（TTL 14 天），
并使用 Jaccard 相似度 + 命名实体对报道同一故事的文章进行去重。

无外部依赖：仅使用标准库 Python（urllib、xml.etree、email.utils）。

触发短语

- 进行一次信息监测
安全/科技/加密货币/人工智能方面有什么新消息？
给我今天的新闻
关于[主题]的最新文章
RSS 监测
早间摘要
未读新闻

快速开始

bash

1. 设置

python3 scripts/setup.py

2. 验证

python3 scripts/init.py

3. 获取 + 评分 + 发送（完整流程）

python3 scripts/veille.py fetch --filter-seen --filter-topic \ | python3 scripts/veille.py score \ | python3 scripts/veille.py send

设置

系统要求

- Python 3.9+
可访问 RSS 订阅源（公开，无需认证）
无需 pip 安装

安装

bash

从技能目录执行

python3 scripts/setup.py

验证

python3 scripts/init.py

向导会创建：

- ~/.openclaw/config/veille/config.json（基于 config.example.json）
~/.openclaw/data/veille/（数据目录）

自定义源

编辑 ~/.openclaw/config/veille/config.json，在 sources 字典中添加/删除条目：

json
{
sources: {
我的博客: https://example.com/feed.xml,
BleepingComputer: https://www.bleepingcomputer.com/feed/
}
}

存储与凭据

本技能写入的文件

路径	写入者	用途	包含机密
~/.openclaw/config/veille/config.json	setup.py	源、输出、选项	否
~/.openclaw/data/veille/seen_urls.json

从技能外部读取的文件

路径	读取者	访问的键	时机
~/.openclaw/openclaw.json	dispatch.py	channels.telegram.botToken（只读）	仅当启用了 telegrambot 输出且输出配置中未设置 bottoken 时

这是唯一的跨配置读取。要完全避免此操作，请在输出配置中显式设置 bot_token：

json
{ type: telegrambot, bottoken: 你的机器人令牌, chat_id: ..., enabled: true }

输出凭据（可选）

凭据仅在启用相应输出时使用。核心功能（RSS 获取 + 去重）不需要任何凭据。

输出	凭据来源	使用内容
telegrambot	~/.openclaw/openclaw.json 或输出配置中的 bottoken	机器人令牌（只读）
mail-client

卸载时清理

bash
python3 scripts/setup.py --cleanup

安全模型

凭据隔离

- API 密钥从专用文件（默认 ~/.openclaw/secrets/）读取，绝不从 config.json 读取。评分器在运行时如果密钥文件的文件系统权限过于宽松会发出警告。
SMTP 凭据（仅回退）存储在输出配置块中——使用 mail-client 技能委托可避免存储 SMTP 密码。

子进程边界

- Dispatch 通过 subprocess.run()（从不使用 shell=True）委托给其他 OpenClaw 技能。脚本路径在执行前会验证是否位于 ~/.openclaw/workspace/skills/ 下，防止路径遍历。
凭据不会作为子进程参数传递——每个技能管理自己的认证。

文件输出安全

- file 输出类型在写入前验证目标路径：默认只允许 ~/.openclaw/。可通过 config.security.allowedoutputdirs 将其他目录加入白名单。无论白名单如何，敏感路径（.ssh、.gnupg、/etc/、.bashrc 等）始终被阻止。
写入的内容会检查可疑模式（shell shebang、SSH 密钥、PGP 块、代码注入）并限制大小为 1 MB。

跨配置读取

- 唯一的跨配置文件读取是 ~/.openclaw/openclaw.json 中的 Telegram 机器人令牌，且仅当启用了 telegrambot 输出且未设置显式 bottoken 时。此读取会记录到 stderr。在输出配置中设置 bot_token 可完全消除此读取。

自主分发

- 当按计划（cron）运行时，技能可以在无需用户交互的情况下向配置的输出发送消息/文件。所有分发操作都会记录到 stderr 并附带审计摘要。在任何输出上使用 enabled: false 可禁用它而无需移除其配置。

CLI 参考

fetch

python3 veille.py fetch [--hours N] [--filter-seen] [--filter-topic] [--sources FILE]

选项：

- --hours N：回溯窗口小时数（默认：来自配置，通常为 24）
--filter-seen：过滤已见过的 URL（使用 seenurls.json TTL 存储）
--filter-topic：按主题去重（使用 topicseen.json + Jaccard 相似度）
--sources FILE：自定义 JSON 源文件的路径

输出（stdout 上的 JSON）：
json
{
hours: 24,
count: 42,
skipped_url: 5,
skipped_topic: 3,
articles: [...],
wrapped_listing: === 不可信的外部内容 ...
}

seen-stats

python3 veille.py seen-stats

显示 URL 已见存储的统计信息（数量、TTL、文件路径）。

topic-stats

python3 veille.py topic-stats

显示主题去重存储的统计信息。

mark-seen

python3 veille.py mark-seen URL [URL ...]

将一个或多个 URL 标记为已见（阻止它们在未来的 --filter-seen 获取中出现）。

score

python3 veille.py score [--dry-run]

从 stdin 读取摘要 JSON（fetch 的输出）并使用兼容 OpenAI 的 LLM 对文章进行评分。
返回带有 scored、ghost_picks 以及每篇文章的 score/reason 字段的增强 JSON。

选项：

- --dry-run：在 stderr 上打印摘要而不调用 LLM API

当 llm.enabled 为 false（默认）时，文章原样通过（scored: false）。

管道用法：
bash
python3 veille.py fetch --filter-seen --filter-topic | python3 veille.py score | python3 veille.py send

send

python3 veille.py send [--profile NAME]

从 stdin 读取摘要 JSON 并分发到 config.json 中配置的所有已启用输出。
接受原始获取输出（articles 键）和 LLM 处理的摘要（categories 键）。

输出类型：telegram_bot、mail-client、nextcloud、file。

- telegram_bot：机器人令牌自动从 OpenClaw 配置读取——如果已配置 Telegram 则无需额外设置。
mail-client：如果已安装则委托给 mail-client 技能，否则回退到原始 SMTP 配置。
nextcloud：如果已

fox-veilleRSS聚合器

fox-veille