Agentic Paper Digest
When to use
- - Fetch a recent paper digest from arXiv and Hugging Face.
- Produce JSON output for downstream agents.
- Run a local API server when a polling workflow is needed.
Prereqs
- - Python 3 and network access.
- LLM access via
OPENAI_API_KEY or an OpenAI-compatible provider via LITELLM_API_BASE + LITELLM_API_KEY. - INLINECODE3 is optional for bootstrap; otherwise
curl/wget (or Python) is used to download the repo.
Get the code and install
- - Preferred: run the bootstrap helper script. It uses git when available or falls back to a zip download.
CODEBLOCK0
- - Override the clone location by setting
PROJECT_DIR.
CODEBLOCK1
Run (CLI preferred)
CODEBLOCK2
- - Pass through CLI flags as needed.
CODEBLOCK3
Run (API optional)
CODEBLOCK4
- - Trigger runs and read results.
CODEBLOCK5
- - Stop the API server if needed.
CODEBLOCK6
Outputs
- - CLI
--json prints run_id, seen, kept, window_start, and window_end. - Data store:
data/papers.sqlite3 (under PROJECT_DIR). - API:
POST /api/run, GET /api/status, GET /api/papers, GET/POST /api/topics, GET/POST /api/settings.
Configuration
Config files live in
PROJECT_DIR/config. Environment variables can be set in the shell or via a
.env file. The wrappers here auto-load
.env from
PROJECT_DIR (override with
ENV_FILE=/path/to/.env).
Environment (.env or exported vars)
- -
OPENAI_API_KEY: required for OpenAI models (litellm reads this). - INLINECODE26 ,
LITELLM_API_KEY: use an OpenAI-compatible proxy/provider. - INLINECODE28 ,
LITELLM_MODEL_SUMMARY: models for relevance and summarization (summary defaults to relevance model if unset). - INLINECODE30 ,
LITELLM_TEMPERATURE_SUMMARY: lower for more deterministic output. - INLINECODE32 : retry count for LLM calls.
- INLINECODE33 : drop unsupported params to avoid provider errors.
- INLINECODE34 ,
APP_TZ: recency window and timezone. - INLINECODE36 : comma-separated categories (default includes
cs.CL,cs.AI,cs.LG,stat.ML,cs.CR). - INLINECODE38 ,
HF_API_BASE: override source endpoints if needed. - INLINECODE40 ,
ARXIV_PAGE_SIZE: arXiv paging limits. - INLINECODE42 : cap candidates per source before LLM filtering.
- INLINECODE43 ,
REQUEST_TIMEOUT_S: source fetch and per-request timeouts. - INLINECODE45 : include first-page PDF text in summaries; requires
PyMuPDF (pip install pymupdf). - INLINECODE48 : location for
papers.sqlite3. - INLINECODE50 : comma-separated origins allowed by the API server (UI use).
- Path overrides:
TOPICS_PATH, SETTINGS_PATH, AFFILIATION_BOOSTS_PATH.
Config files
- -
config/topics.json: list of topics with id, label, description, max_per_topic, and keywords. The relevance classifier must output topic IDs exactly as defined here. max_per_topic also caps results in GET /api/papers when apply_topic_caps=1. - INLINECODE63 : overrides fetch limits (
arxiv_max_results, arxiv_page_size, fetch_timeout_s, max_candidates_per_source). Updated via POST /api/settings. - INLINECODE69 : list of
{pattern, weight} boosts applied by substring match over affiliations. Weights add up and are capped at 1.0. Invalid JSON disables boosts, so keep the file strict JSON (no trailing commas).
Mandatory workflow (follow step-by-step)
- 1. You first MUST open and read the configuration from the github repo: https://github.com/matanle51/agenticpaperdigest you downloaded:
- Load
config/topics.json,
config/settings.json, and
config/affiliations.json (if present).
- Note current topic IDs, caps, and fetch limits before asking the user to change them.
- 2. ASK THE USER TO PROVIDE IT'S PREFERENCES ABOUT THE FOLLOWING (HELP THE USER):
-
Topics of interest → update
config/topics.json (
topics[].id/label/description/keywords,
max_per_topic).
Show current defaults and ask whether to keep or change them.
-
Time window (hours) → set
WINDOW_HOURS (or pass
--window-hours to CLI)
only if the user cares; otherwise keep default to 24h.
- ASK THE USER TO FILL THE FOLLOWING PARAMETERS (explain the user why are their intent):
ARXIV_CATEGORIES,
ARXIV_MAX_RESULTS,
ARXIV_PAGE_SIZE,
MAX_CANDIDATES_PER_SOURCE.
Ask whether to keep defaults and show the current values.
-
Model/provider → set
OPENAI_API_KEY or LITELLM_API_KEY (+
LITELLM_API_BASE if proxy), and set
LITELLM_MODEL_RELEVANCE/
LITELLM_MODEL_SUMMARY.
-
Do NOT ask by default: timezone, quality vs cost, timeouts, PDF text, affiliation biasing, sources list. Use defaults unless the user requests changes.
- 3. Confirm workspace path: Ask where to clone/run. Default to
PROJECT_DIR="$HOME/agentic_paper_digest" if the user doesn’t care. Never hardcode /Users/... paths. - Bootstrap the repo: Run the bootstrap script (unless the repo already exists and the user says to skip).
- Create or verify
.env:
- If
.env is missing, create it from
.env.example (in the repo), then ask the user to fill keys and any requested preferences.
- Ensure at least one of
OPENAI_API_KEY or
LITELLM_API_KEY is set before running.
- 6. Apply config changes:
- Edit JSON files directly (or use
POST /api/topics and
POST /api/settings if running the API).
- 7. Run the pipeline:
- Prefer
scripts/run_cli.sh for one-off JSON output.
- Use
scripts/run_api.sh only if the user explicitly asks for UI/API access or polling.
- 8. Report results:
- If results are sparse, suggest increasing
WINDOW_HOURS,
ARXIV_MAX_RESULTS, or broadening topics.
Getting good results
- - Help the user define and keep topics focused and mutually exclusive so the classifier can choose the right IDs.
- Use a stronger model for summaries than for relevance if quality matters.
- If using openAI's model, defualy to gpt-5-mini for good tradeoff.
- Increase
WINDOW_HOURS or ARXIV_MAX_RESULTS when results are sparse, or lower them if results are too noisy. - Tune
ARXIV_CATEGORIES to your research domains. - Enable PDF text (
ENABLE_PDF_TEXT=1) when abstracts are too thin. - Use modest affiliation weights to bias ranking without swamping relevance.
- BE PROACTIVE AND HELP THE USER TUNE THE SKILL FOR GOOD RESULTS!
Troubleshooting
- - Port 8000 busy: run
bash "{baseDir}/scripts/stop_api.sh" or pass --port to the API command. - Empty results: increase
WINDOW_HOURS or verify the API key in .env. - Missing API key errors: export
OPENAI_API_KEY or LITELLM_API_KEY in the shell before running.
Agentic Paper Digest
使用场景
- - 从arXiv和Hugging Face获取最新论文摘要。
- 为下游智能体生成JSON输出。
- 在需要轮询工作流时运行本地API服务器。
前置条件
- - Python 3及网络访问权限。
- 通过OPENAIAPIKEY或通过LITELLMAPIBASE + LITELLMAPIKEY使用兼容OpenAI的提供商进行LLM访问。
- git为可选(用于引导);否则使用curl/wget(或Python)下载仓库。
获取代码并安装
- - 推荐:运行引导辅助脚本。它会优先使用git(如果可用),否则回退到zip下载。
bash
bash {baseDir}/scripts/bootstrap.sh
bash
PROJECTDIR=$HOME/agenticpaper_digest bash {baseDir}/scripts/bootstrap.sh
运行(推荐CLI)
bash
bash {baseDir}/scripts/run_cli.sh
bash
bash {baseDir}/scripts/run_cli.sh --window-hours 24 --sources arxiv,hf
运行(可选API)
bash
bash {baseDir}/scripts/run_api.sh
bash
curl -X POST http://127.0.0.1:8000/api/run
curl http://127.0.0.1:8000/api/status
curl http://127.0.0.1:8000/api/papers
bash
bash {baseDir}/scripts/stop_api.sh
输出
- - CLI的--json参数输出runid、seen、kept、windowstart和windowend。
- 数据存储:data/papers.sqlite3(位于PROJECTDIR下)。
- API:POST /api/run、GET /api/status、GET /api/papers、GET/POST /api/topics、GET/POST /api/settings。
配置
配置文件位于PROJECT
DIR/config。环境变量可在shell中设置,或通过.env文件设置。此处的包装器会自动从PROJECTDIR加载.env(可通过ENV_FILE=/path/to/.env覆盖)。
环境变量(.env或导出的变量)
- - OPENAIAPIKEY:OpenAI模型必需(litellm读取此项)。
- LITELLMAPIBASE、LITELLMAPIKEY:使用兼容OpenAI的代理/提供商。
- LITELLMMODELRELEVANCE、LITELLMMODELSUMMARY:用于相关性和摘要的模型(摘要默认使用相关性模型,如未设置)。
- LITELLMTEMPERATURERELEVANCE、LITELLMTEMPERATURESUMMARY:较低值可获得更确定性的输出。
- LITELLMMAXRETRIES:LLM调用的重试次数。
- LITELLMDROPPARAMS=1:丢弃不支持的参数以避免提供商错误。
- WINDOWHOURS、APPTZ:时效窗口和时区。
- ARXIVCATEGORIES:逗号分隔的类别(默认包含cs.CL,cs.AI,cs.LG,stat.ML,cs.CR)。
- ARXIVAPIBASE、HFAPIBASE:如需覆盖源端点。
- ARXIVMAXRESULTS、ARXIVPAGESIZE:arXiv分页限制。
- MAXCANDIDATESPERSOURCE:LLM过滤前每个源的最大候选数上限。
- FETCHTIMEOUTS、REQUESTTIMEOUTS:源获取和每次请求的超时时间。
- ENABLEPDFTEXT=1:在摘要中包含PDF首页文本;需要PyMuPDF(pip install pymupdf)。
- DATADIR:papers.sqlite3的位置。
- CORSORIGINS:API服务器允许的逗号分隔源(UI使用)。
- 路径覆盖:TOPICSPATH、SETTINGSPATH、AFFILIATIONBOOSTSPATH。
配置文件
- - config/topics.json:主题列表,包含id、label、description、maxpertopic和keywords。相关性分类器必须输出与此处定义完全一致的主题ID。当applytopiccaps=1时,maxpertopic也会限制GET /api/papers的结果。
- config/settings.json:覆盖获取限制(arxivmaxresults、arxivpagesize、fetchtimeouts、maxcandidatesper_source)。通过POST /api/settings更新。
- config/affiliations.json:{pattern, weight}列表,通过子字符串匹配对所属机构进行加权。权重累加,上限为1.0。无效JSON会禁用加权,因此请保持文件为严格JSON格式(无尾随逗号)。
强制工作流(按步骤执行)
- 1. 首先必须打开并阅读从GitHub仓库下载的配置:https://github.com/matanle51/agenticpaperdigest:
- 加载config/topics.json、config/settings.json和config/affiliations.json(如存在)。
- 在询问用户更改前,先记录当前主题ID、上限和获取限制。
- 2. 请用户提供以下偏好(帮助用户):
-
感兴趣的主题 → 更新config/topics.json(topics[].id/label/description/keywords、max
pertopic)。
显示当前默认值,询问是否保留或更改。
-
时间窗口(小时) → 设置WINDOW_HOURS(或向CLI传递--window-hours)
仅当用户在意时;否则保持默认24小时。
- 请用户填写以下参数(向用户解释其意图):ARXIV
CATEGORIES、ARXIVMAX
RESULTS、ARXIVPAGE
SIZE、MAXCANDIDATES
PERSOURCE。
询问是否保留默认值并显示当前值。
-
模型/提供商 → 设置OPENAI
APIKEY
或LITELLM
APIKEY(+ 如使用代理则设置LITELLM
APIBASE),并设置LITELLM
MODELRELEVANCE/LITELLM
MODELSUMMARY。
-
默认不询问:时区、质量与成本、超时、PDF文本、所属机构偏置、源列表。除非用户要求更改,否则使用默认值。
- 3. 确认工作空间路径:询问克隆/运行位置。如果用户不在意,默认使用PROJECTDIR=$HOME/agenticpaper_digest。切勿硬编码/Users/...路径。
- 引导仓库:运行引导脚本(除非仓库已存在且用户表示跳过)。
- 创建或验证.env:
- 如果.env缺失,从.env.example(在仓库中)创建,然后请用户填写密钥和任何请求的偏好。
- 在运行前确保至少设置了OPENAI
APIKEY或LITELLM
APIKEY之一。
- 6. 应用配置更改:
- 直接编辑JSON文件(或如果运行API,使用POST /api/topics和POST /api/settings)。
- 7. 运行流水线:
- 对于一次性JSON输出,优先使用scripts/run_cli.sh。
- 仅当用户明确要求UI/API访问或轮询时,使用scripts/run_api.sh。
- 8. 报告结果:
- 如果结果稀疏,建议增加WINDOW
HOURS、ARXIVMAX_RESULTS或拓宽主题。
获得良好结果
- - 帮助用户定义并保持主题聚焦且互斥,以便分类器能选择正确的ID。
- 如果质量重要,对摘要使用比相关性更强的模型。
- 如果使用OpenAI模型,默认使用gpt-5-mini以获得良好平衡。
- 当结果稀疏时增加WINDOWHOURS或ARXIVMAXRESULTS,当结果过于嘈杂时降低它们。
- 根据研究领域调整ARXIVCATEGORIES。
- 当摘要过于简略时启用PDF文本(ENABLEPDFTEXT=1)。
- 使用适度的所属机构权重来偏置排名,而不淹没相关性。
- 主动帮助用户调整技能以获得良好结果!
故障排除