Paper Cluster Survey V2.2
Overview
Turn raw paper URLs and PDFs into usable review inputs. Extract structured metadata and text evidence first, then classify the papers, produce a classification table, and write a review that follows common academic survey conventions instead of a rigid fill-in-the-blanks template.
Workflow
1. Normalize the source set
- - Accept multiple local PDF paths, arXiv URLs, DOI URLs, and general paper URLs.
- Use
scripts/normalize-sources.mjs when the source set is mixed or should be stored as a reusable manifest. - Preserve the original source string for traceability.
2. Extract paper records before reasoning
- - Use
scripts/extract-paper-records.mjs to turn PDFs and URLs into structured records before classification. - The extraction pass should gather as much of the following as possible:
-
title
-
authors
-
year
-
venue
-
abstract
-
task
-
method
-
datasets
-
metrics
-
main_contribution
-
limitations
-
source
-
extraction_notes
- - Treat extracted records as the primary context for classification and survey drafting.
- If important fields are missing, only fall back to direct source reading for the specific missing details.
Read extraction-pipeline.md when deciding how much to trust the extracted fields and when to re-open the raw source.
3. Verify evidence quality
- - Do not classify from titles alone when abstract or body text is available.
- Prefer abstract, introduction, and method section.
- Mark uncertain inferences explicitly.
- If the extractor had to fall back to weak methods, keep claims conservative.
4. Design the classification scheme
- - Produce a classification scheme before writing the review.
- Prefer task-based categories first.
- If tasks are too similar, classify by method family.
- Use application-domain categories only when they best explain the corpus.
- Keep the taxonomy shallow unless the corpus is large.
- Assign one primary category to each paper unless the user explicitly wants multi-label grouping.
Read taxonomy-guidelines.md when the category design is ambiguous.
5. Output the classification table
- - Always provide one classification table before the review.
- Include:
- paper
- year
- category
- rationale
- evidence used
- - Keep rationales brief and evidence-based.
6. Decide the review shape
Default rule:
- - Write one integrated literature review for the entire corpus after the classification table.
Exception:
- - If the user explicitly asks for "each category write a survey", "分别写综述", "per-category review", or equivalent, write separate review sections for each category.
7. Write the review in academic survey style
The review must read like a normal survey paper, not a bullet summary dump.
- - Use a concise academic title.
- Include an abstract when the output is formal enough to justify it.
- Include keywords when they help position the review.
- Use an introduction that explains background, significance, scope, source selection, and the organizing logic of the review.
- Organize the main body by the most defensible basis for the corpus:
- chronology
- research themes
- method families
- viewpoints or schools
- - End with a conclusion or concluding discussion.
- Add future directions, outlook, or open problems when the corpus supports them.
- List references in GB/T 7714 style when bibliographic data is available.
Typical sections in a strong review are:
- - title
- abstract
- keywords
- introduction
- themed main sections
- discussion, conclusion, or both
- future directions or open problems when useful
- references
Not every output needs every section. Match the structure to the user's request, the corpus size, and the field while staying recognizably review-like.
Read review-paper-style.md when drafting the prose review or choosing section structure.
8. Keep the prose review-like
- - Prefer connected academic prose over bullet dumps.
- Use tables only to support comparison, not replace the review.
- Do not invent datasets, metrics, or reference details.
- If extracted metadata is incomplete, keep partial references and state what is missing.
Output Contract
Return results in this order unless the user asks otherwise:
- 1. Corpus summary
- Classification scheme
- Classification table
- Formal review article
- References
If the user wants structured output, read output-schema.md.
Bundled Scripts
scripts/normalize-sources.mjs
- - Normalize mixed PDF and URL inputs into a JSON manifest.
- Use when the source set is large, mixed, or should be reused.
scripts/extract-paper-records.mjs
- - Fetch URLs, resolve likely paper metadata, and extract paper text evidence from URLs or PDFs.
- Prefer this script before asking the model to reason over a large source set.
- Use its output as the main context object for classification and review drafting.
scripts/render-formal-review-template.mjs
- - Render a flexible academic-review scaffold from structured paper records.
- Default to one integrated review.
- Use
--per-category only when the user explicitly asks for separate category reviews.
Quality Bar
- - Run extraction before classification unless the user already gave structured paper records.
- Keep classification and review consistent with extracted evidence.
- Use raw source re-reading only to fill important gaps.
- If the extractor had to rely on weak fallbacks, say so.
论文聚类综述 V2.2
概述
将原始论文URL和PDF转化为可用的综述输入。首先提取结构化元数据和文本证据,然后对论文进行分类,生成分类表,并撰写遵循通用学术综述惯例而非僵化填空模板的综述。
工作流程
1. 规范化源集合
- - 接受多个本地PDF路径、arXiv URL、DOI URL和通用论文URL。
- 当源集合混合或应存储为可复用清单时,使用 scripts/normalize-sources.mjs。
- 保留原始源字符串以实现可追溯性。
2. 在推理前提取论文记录
- - 使用 scripts/extract-paper-records.mjs 在分类前将PDF和URL转化为结构化记录。
- 提取过程应尽可能收集以下信息:
- 标题
- 作者
- 年份
- 发表场所
- 摘要
- 任务
- 方法
- 数据集
- 评价指标
- 主要贡献
- 局限性
- 来源
- 提取备注
- - 将提取的记录视为分类和综述起草的主要上下文。
- 如果重要字段缺失,仅针对具体缺失细节回退到直接阅读源文件。
在决定对提取字段的信任程度以及何时重新打开原始源时,请阅读 extraction-pipeline.md。
3. 验证证据质量
- - 当摘要或正文文本可用时,不要仅凭标题进行分类。
- 优先使用摘要、引言和方法部分。
- 明确标记不确定的推断。
- 如果提取器不得不回退到弱方法,保持论断保守。
4. 设计分类方案
- - 在撰写综述前生成分类方案。
- 优先使用基于任务的类别。
- 如果任务过于相似,按方法家族分类。
- 仅当应用领域类别最能解释语料库时才使用。
- 保持分类层次浅显,除非语料库规模较大。
- 除非用户明确要求多标签分组,否则为每篇论文分配一个主要类别。
当类别设计不明确时,请阅读 taxonomy-guidelines.md。
5. 输出分类表
- 论文
- 年份
- 类别
- 理由
- 使用的证据
6. 确定综述形式
默认规则:
- - 在分类表之后为整个语料库撰写一篇综合文献综述。
例外情况:
- - 如果用户明确要求each category write a survey、分别写综述、per-category review或类似表述,则为每个类别撰写独立的综述部分。
7. 以学术综述风格撰写
综述必须读起来像一篇正常的综述论文,而非要点摘要堆砌。
- - 使用简洁的学术标题。
- 当输出足够正式时包含摘要。
- 当关键词有助于定位综述时包含关键词。
- 使用引言解释背景、重要性、范围、源选择以及综述的组织逻辑。
- 根据语料库最合理的依据组织主体部分:
- 时间顺序
- 研究主题
- 方法家族
- 观点或学派
- - 以结论或总结性讨论结尾。
- 当语料库支持时,添加未来方向、展望或开放问题。
- 当书目数据可用时,以GB/T 7714格式列出参考文献。
一篇优秀综述的典型部分包括:
- - 标题
- 摘要
- 关键词
- 引言
- 主题化主体部分
- 讨论、结论或两者兼具
- 未来方向或开放问题(如有必要)
- 参考文献
并非每个输出都需要所有部分。根据用户请求、语料库规模和领域调整结构,同时保持可识别的综述风格。
在起草散文式综述或选择章节结构时,请阅读 review-paper-style.md。
8. 保持散文式综述风格
- - 优先使用连贯的学术散文而非要点堆砌。
- 仅使用表格支持比较,而非替代综述。
- 不要虚构数据集、评价指标或参考文献细节。
- 如果提取的元数据不完整,保留部分参考文献并说明缺失内容。
输出约定
除非用户另有要求,按以下顺序返回结果:
- 1. 语料库摘要
- 分类方案
- 分类表
- 正式综述文章
- 参考文献
如果用户需要结构化输出,请阅读 output-schema.md。
捆绑脚本
scripts/normalize-sources.mjs
- - 将混合的PDF和URL输入规范化为JSON清单。
- 当源集合规模大、混合或应重复使用时使用。
scripts/extract-paper-records.mjs
- - 获取URL,解析可能的论文元数据,并从URL或PDF中提取论文文本证据。
- 在要求模型对大规模源集合进行推理前优先使用此脚本。
- 将其输出作为分类和综述起草的主要上下文对象。
scripts/render-formal-review-template.mjs
- - 从结构化论文记录渲染灵活的学术综述框架。
- 默认输出一篇综合综述。
- 仅在用户明确要求独立类别综述时使用 --per-category。
质量标准
- - 除非用户已提供结构化论文记录,否则在分类前运行提取。
- 保持分类和综述与提取的证据一致。
- 仅使用重新阅读原始源来填补重要空白。
- 如果提取器不得不依赖弱回退方法,请明确说明。