Paper Cluster Survey V2.2

Overview

Turn raw paper URLs and PDFs into usable review inputs. Extract structured metadata and text evidence first, then classify the papers, produce a classification table, and write a review that follows common academic survey conventions instead of a rigid fill-in-the-blanks template.

Workflow

1. Normalize the source set

- Accept multiple local PDF paths, arXiv URLs, DOI URLs, and general paper URLs.
Use scripts/normalize-sources.mjs when the source set is mixed or should be stored as a reusable manifest.
Preserve the original source string for traceability.

2. Extract paper records before reasoning

- Use scripts/extract-paper-records.mjs to turn PDFs and URLs into structured records before classification.
The extraction pass should gather as much of the following as possible:

- title - authors - year - venue - abstract - task - method - datasets - metrics - main_contribution - limitations - source - extraction_notes

- Treat extracted records as the primary context for classification and survey drafting.
If important fields are missing, only fall back to direct source reading for the specific missing details.

Read extraction-pipeline.md when deciding how much to trust the extracted fields and when to re-open the raw source.

3. Verify evidence quality

- Do not classify from titles alone when abstract or body text is available.
Prefer abstract, introduction, and method section.
Mark uncertain inferences explicitly.
If the extractor had to fall back to weak methods, keep claims conservative.

4. Design the classification scheme

- Produce a classification scheme before writing the review.
Prefer task-based categories first.
If tasks are too similar, classify by method family.
Use application-domain categories only when they best explain the corpus.
Keep the taxonomy shallow unless the corpus is large.
Assign one primary category to each paper unless the user explicitly wants multi-label grouping.

Read taxonomy-guidelines.md when the category design is ambiguous.

5. Output the classification table

- Always provide one classification table before the review.
Include:

- paper - year - category - rationale - evidence used

- Keep rationales brief and evidence-based.

6. Decide the review shape

Default rule:

- Write one integrated literature review for the entire corpus after the classification table.

Exception:

- If the user explicitly asks for "each category write a survey", "分别写综述", "per-category review", or equivalent, write separate review sections for each category.

7. Write the review in academic survey style

The review must read like a normal survey paper, not a bullet summary dump.

- Use a concise academic title.
Include an abstract when the output is formal enough to justify it.
Include keywords when they help position the review.
Use an introduction that explains background, significance, scope, source selection, and the organizing logic of the review.
Organize the main body by the most defensible basis for the corpus:

- chronology - research themes - method families - viewpoints or schools

- End with a conclusion or concluding discussion.
Add future directions, outlook, or open problems when the corpus supports them.
List references in GB/T 7714 style when bibliographic data is available.

Typical sections in a strong review are:

- title
abstract
keywords
introduction
themed main sections
discussion, conclusion, or both
future directions or open problems when useful
references

Not every output needs every section. Match the structure to the user's request, the corpus size, and the field while staying recognizably review-like.

Read review-paper-style.md when drafting the prose review or choosing section structure.

8. Keep the prose review-like

- Prefer connected academic prose over bullet dumps.
Use tables only to support comparison, not replace the review.
Do not invent datasets, metrics, or reference details.
If extracted metadata is incomplete, keep partial references and state what is missing.

Output Contract

Return results in this order unless the user asks otherwise:

1. Corpus summary
Classification scheme
Classification table
Formal review article
References

If the user wants structured output, read output-schema.md.

Bundled Scripts

`scripts/normalize-sources.mjs`

- Normalize mixed PDF and URL inputs into a JSON manifest.
Use when the source set is large, mixed, or should be reused.

`scripts/extract-paper-records.mjs`

- Fetch URLs, resolve likely paper metadata, and extract paper text evidence from URLs or PDFs.
Prefer this script before asking the model to reason over a large source set.
Use its output as the main context object for classification and review drafting.

`scripts/render-formal-review-template.mjs`

- Render a flexible academic-review scaffold from structured paper records.
Default to one integrated review.
Use --per-category only when the user explicitly asks for separate category reviews.

Quality Bar

- Run extraction before classification unless the user already gave structured paper records.
Keep classification and review consistent with extracted evidence.
Use raw source re-reading only to fill important gaps.
If the extractor had to rely on weak fallbacks, say so.

论文聚类综述 V2.2

概述

将原始论文URL和PDF转化为可用的综述输入。首先提取结构化元数据和文本证据，然后对论文进行分类，生成分类表，并撰写遵循通用学术综述惯例而非僵化填空模板的综述。

工作流程

1. 规范化源集合

- 接受多个本地PDF路径、arXiv URL、DOI URL和通用论文URL。
当源集合混合或应存储为可复用清单时，使用 scripts/normalize-sources.mjs。
保留原始源字符串以实现可追溯性。

2. 在推理前提取论文记录

- 使用 scripts/extract-paper-records.mjs 在分类前将PDF和URL转化为结构化记录。
提取过程应尽可能收集以下信息：

- 标题 - 作者 - 年份 - 发表场所 - 摘要 - 任务 - 方法 - 数据集 - 评价指标 - 主要贡献 - 局限性 - 来源 - 提取备注

- 将提取的记录视为分类和综述起草的主要上下文。
如果重要字段缺失，仅针对具体缺失细节回退到直接阅读源文件。

在决定对提取字段的信任程度以及何时重新打开原始源时，请阅读 extraction-pipeline.md。

3. 验证证据质量

- 当摘要或正文文本可用时，不要仅凭标题进行分类。
优先使用摘要、引言和方法部分。
明确标记不确定的推断。
如果提取器不得不回退到弱方法，保持论断保守。

4. 设计分类方案

- 在撰写综述前生成分类方案。
优先使用基于任务的类别。
如果任务过于相似，按方法家族分类。
仅当应用领域类别最能解释语料库时才使用。
保持分类层次浅显，除非语料库规模较大。
除非用户明确要求多标签分组，否则为每篇论文分配一个主要类别。

当类别设计不明确时，请阅读 taxonomy-guidelines.md。

5. 输出分类表

- 始终在综述前提供一个分类表。
包括：

- 论文 - 年份 - 类别 - 理由 - 使用的证据

- 保持理由简洁且基于证据。

6. 确定综述形式

默认规则：

- 在分类表之后为整个语料库撰写一篇综合文献综述。

例外情况：

- 如果用户明确要求each category write a survey、分别写综述、per-category review或类似表述，则为每个类别撰写独立的综述部分。

7. 以学术综述风格撰写

综述必须读起来像一篇正常的综述论文，而非要点摘要堆砌。

- 使用简洁的学术标题。
当输出足够正式时包含摘要。
当关键词有助于定位综述时包含关键词。
使用引言解释背景、重要性、范围、源选择以及综述的组织逻辑。
根据语料库最合理的依据组织主体部分：

- 时间顺序 - 研究主题 - 方法家族 - 观点或学派

- 以结论或总结性讨论结尾。
当语料库支持时，添加未来方向、展望或开放问题。
当书目数据可用时，以GB/T 7714格式列出参考文献。

一篇优秀综述的典型部分包括：

- 标题
摘要
关键词
引言
主题化主体部分
讨论、结论或两者兼具
未来方向或开放问题（如有必要）
参考文献

并非每个输出都需要所有部分。根据用户请求、语料库规模和领域调整结构，同时保持可识别的综述风格。

在起草散文式综述或选择章节结构时，请阅读 review-paper-style.md。

8. 保持散文式综述风格

- 优先使用连贯的学术散文而非要点堆砌。
仅使用表格支持比较，而非替代综述。
不要虚构数据集、评价指标或参考文献细节。
如果提取的元数据不完整，保留部分参考文献并说明缺失内容。

输出约定

除非用户另有要求，按以下顺序返回结果：

1. 语料库摘要
分类方案
分类表
正式综述文章
参考文献

如果用户需要结构化输出，请阅读 output-schema.md。

捆绑脚本

scripts/normalize-sources.mjs

- 将混合的PDF和URL输入规范化为JSON清单。
当源集合规模大、混合或应重复使用时使用。

scripts/extract-paper-records.mjs

- 获取URL，解析可能的论文元数据，并从URL或PDF中提取论文文本证据。
在要求模型对大规模源集合进行推理前优先使用此脚本。
将其输出作为分类和综述起草的主要上下文对象。

scripts/render-formal-review-template.mjs

- 从结构化论文记录渲染灵活的学术综述框架。
默认输出一篇综合综述。
仅在用户明确要求独立类别综述时使用 --per-category。

质量标准

- 除非用户已提供结构化论文记录，否则在分类前运行提取。
保持分类和综述与提取的证据一致。
仅使用重新阅读原始源来填补重要空白。
如果提取器不得不依赖弱回退方法，请明确说明。

paper-cluster-survey-v2-2论文聚类综述