Paper Ingest Normalizer

Convert raw literature inputs into standardized records safe for project memory, paper databases, and downstream synthesis pipelines.

Input

One of the following is required:

- pdf_path — local path to PDF file
INLINECODE1 — link to paper/article
INLINECODE2 — extracted or pasted text
INLINECODE3 — existing metadata dict

Plus:

- project_id — required for any writeback
INLINECODE5 — one of: pdf, doi, url, text, INLINECODE10
INLINECODE11 — list of strings for categorization

Output Schema

Return a structured object:

CODEBLOCK0

Rules

1. Never write into project memory without project_id. Ask if not provided.
Separate direct observations from claimed interpretations. Mark inference vs. direct extraction.
Preserve uncertainty. Use null for missing fields; list in uncertain_fields.
Do not invent missing bibliographic fields. Don't hallucinate authors, year, etc.
Do not over-claim. Keep core_findings and normalized_summary grounded in what the text actually says.
Never conflate abstract with findings. The abstract states intentions; findings are what the data supports.
If writeback_ready = false, list explicitly which fields are missing and why.

PDF Extraction

For PDFs, use the summarize skill or pdfplumber/PyMuPDF to extract text before processing.

Workflow

1. Identify source type — determine which input field is populated
Extract raw content — PDF text, URL content, or use provided raw text
Parse bibliographic fields — title, authors, year, source, DOI
Identify research content — material system, device type, variables, metrics
Distill findings — separate what was measured from what was claimed
Assemble writeback_payload — structured record matching the schema above
Assess completeness — set writeback_ready based on presence of key identity fields

Failure Handling

If parsing is incomplete:

- Return partial structured output with all successfully extracted fields
Populate uncertain_fields with the list of fields that could not be determined
Set writeback_ready = false when title, authors, or year are missing

Cross-Reference

For synthesis after normalization, see the research skill for paper synthesis workflows.

论文输入标准化器

将原始文献输入转换为标准化记录，确保适用于项目记忆、论文数据库及下游综合处理流程。

输入

需提供以下其中一项：

- pdfpath — PDF文件的本地路径
url — 论文/文章链接
rawtext — 提取或粘贴的文本
metadata_blob — 现有元数据字典

此外还需提供：

- projectid — 任何回写操作必需
sourcetype — 可选值：pdf、doi、url、text、metadata
optional tags — 用于分类的字符串列表

输出结构

返回结构化对象：

title: string
authors: string[] | null
year: number | null
source: string # 期刊、会议、预印本等
doiorurl: string | null
project_id: string
paper_type: string # 实验性、理论性、综述等
material_system: string | null # 例如钙钛矿太阳能电池、石墨烯场效应管
device_type: string | null # 例如FTO/玻璃、柔性基底
key_variables: string[] | null # 研究的自变量
key_metrics: string[] | null # 测量结果（PCE、迁移率等）
core_findings: string # 2-3句中立总结
claimed_mechanism: string | null
limitations: string | null
normalized_summary: string # 1-2段结构化总结
uncertain_fields: string[] | null # 无法验证的字段
writeback_ready: boolean # 仅当关键标识字段存在时为true
writeback_payload: object # 要写入项目记忆的记录

规则

1. 没有projectid时绝不写入项目记忆。 如未提供则询问。
区分直接观察与声称的解释。 标注推断与直接提取。
保留不确定性。 缺失字段使用null；在uncertainfields中列出。
不虚构缺失的文献字段。 不凭空编造作者、年份等。
不过度声称。 确保corefindings和normalizedsummary基于文本实际内容。
绝不混淆摘要与发现。 摘要陈述意图；发现是数据支持的内容。
如果writeback_ready = false，明确列出缺失的字段及其原因。

PDF提取

对于PDF文件，在处理前使用summarize技能或pdfplumber/PyMuPDF提取文本。

工作流程

1. 识别来源类型 — 确定哪个输入字段已填充
提取原始内容 — PDF文本、URL内容或使用提供的原始文本
解析文献字段 — 标题、作者、年份、来源、DOI
识别研究内容 — 材料体系、器件类型、变量、指标
提炼发现 — 区分测量结果与声称内容
组装writebackpayload — 符合上述模式的结构化记录
评估完整性 — 根据关键标识字段的存在情况设置writebackready

失败处理

如果解析不完整：

- 返回部分结构化输出，包含所有成功提取的字段
在uncertainfields中填入无法确定的字段列表
当标题、作者或年份缺失时，设置writebackready = false

交叉参考

标准化后的综合处理，请参阅research技能中的论文综合工作流程。

paper-ingest-normalizer文献标准化