Paper Ingest Normalizer
Convert raw literature inputs into standardized records safe for project memory, paper databases, and downstream synthesis pipelines.
Input
One of the following is required:
- -
pdf_path — local path to PDF file - INLINECODE1 — link to paper/article
- INLINECODE2 — extracted or pasted text
- INLINECODE3 — existing metadata dict
Plus:
- -
project_id — required for any writeback - INLINECODE5 — one of:
pdf, doi, url, text, INLINECODE10 - INLINECODE11 — list of strings for categorization
Output Schema
Return a structured object:
CODEBLOCK0
Rules
- 1. Never write into project memory without project_id. Ask if not provided.
- Separate direct observations from claimed interpretations. Mark inference vs. direct extraction.
- Preserve uncertainty. Use
null for missing fields; list in uncertain_fields. - Do not invent missing bibliographic fields. Don't hallucinate authors, year, etc.
- Do not over-claim. Keep
core_findings and normalized_summary grounded in what the text actually says. - Never conflate abstract with findings. The abstract states intentions; findings are what the data supports.
- If
writeback_ready = false, list explicitly which fields are missing and why.
PDF Extraction
For PDFs, use the summarize skill or pdfplumber/PyMuPDF to extract text before processing.
Workflow
- 1. Identify source type — determine which input field is populated
- Extract raw content — PDF text, URL content, or use provided raw text
- Parse bibliographic fields — title, authors, year, source, DOI
- Identify research content — material system, device type, variables, metrics
- Distill findings — separate what was measured from what was claimed
- Assemble writeback_payload — structured record matching the schema above
- Assess completeness — set
writeback_ready based on presence of key identity fields
Failure Handling
If parsing is incomplete:
- - Return partial structured output with all successfully extracted fields
- Populate
uncertain_fields with the list of fields that could not be determined - Set
writeback_ready = false when title, authors, or year are missing
Cross-Reference
For synthesis after normalization, see the research skill for paper synthesis workflows.
论文输入标准化器
将原始文献输入转换为标准化记录,确保适用于项目记忆、论文数据库及下游综合处理流程。
输入
需提供以下其中一项:
- - pdfpath — PDF文件的本地路径
- url — 论文/文章链接
- rawtext — 提取或粘贴的文本
- metadata_blob — 现有元数据字典
此外还需提供:
- - projectid — 任何回写操作必需
- sourcetype — 可选值:pdf、doi、url、text、metadata
- optional tags — 用于分类的字符串列表
输出结构
返回结构化对象:
title: string
authors: string[] | null
year: number | null
source: string # 期刊、会议、预印本等
doiorurl: string | null
project_id: string
paper_type: string # 实验性、理论性、综述等
material_system: string | null # 例如钙钛矿太阳能电池、石墨烯场效应管
device_type: string | null # 例如FTO/玻璃、柔性基底
key_variables: string[] | null # 研究的自变量
key_metrics: string[] | null # 测量结果(PCE、迁移率等)
core_findings: string # 2-3句中立总结
claimed_mechanism: string | null
limitations: string | null
normalized_summary: string # 1-2段结构化总结
uncertain_fields: string[] | null # 无法验证的字段
writeback_ready: boolean # 仅当关键标识字段存在时为true
writeback_payload: object # 要写入项目记忆的记录
规则
- 1. 没有projectid时绝不写入项目记忆。 如未提供则询问。
- 区分直接观察与声称的解释。 标注推断与直接提取。
- 保留不确定性。 缺失字段使用null;在uncertainfields中列出。
- 不虚构缺失的文献字段。 不凭空编造作者、年份等。
- 不过度声称。 确保corefindings和normalizedsummary基于文本实际内容。
- 绝不混淆摘要与发现。 摘要陈述意图;发现是数据支持的内容。
- 如果writeback_ready = false,明确列出缺失的字段及其原因。
PDF提取
对于PDF文件,在处理前使用summarize技能或pdfplumber/PyMuPDF提取文本。
工作流程
- 1. 识别来源类型 — 确定哪个输入字段已填充
- 提取原始内容 — PDF文本、URL内容或使用提供的原始文本
- 解析文献字段 — 标题、作者、年份、来源、DOI
- 识别研究内容 — 材料体系、器件类型、变量、指标
- 提炼发现 — 区分测量结果与声称内容
- 组装writebackpayload — 符合上述模式的结构化记录
- 评估完整性 — 根据关键标识字段的存在情况设置writebackready
失败处理
如果解析不完整:
- - 返回部分结构化输出,包含所有成功提取的字段
- 在uncertainfields中填入无法确定的字段列表
- 当标题、作者或年份缺失时,设置writebackready = false
交叉参考
标准化后的综合处理,请参阅research技能中的论文综合工作流程。