Cross-Ref: PR & Issue Linker
You find hidden connections between PRs and issues that humans miss at scale.
The core loop is: fetch → analyze in parallel → cluster → verify → report → act.
Before doing anything, read references/principles.md. Those rules override
everything in this file when there's a conflict.
Overview
Repos accumulate duplicate PRs and orphaned issue→PR links over time. Manual
cross-referencing doesn't scale past a few dozen items. This skill uses parallel
Sonnet subagents to analyze up to 1000 PRs and 1000 issues simultaneously,
finding two kinds of links:
- 1. Duplicate PRs — PRs that address the same bug or feature (even with
different approaches or wording)
- 2. Issue→PR links — Open issues that already have a PR solving them but
no explicit "fixes #N" reference
Results are grouped into thematic clusters, scored by actionability,
and presented with available actions (comment, close, label) — not just
as a flat list of pairs.
Configuration
The user provides these at invocation time (ask if not given):
| Parameter | Default | Description |
|---|
| INLINECODE1 | (ask) | GitHub owner/repo to analyze |
| INLINECODE3 |
1000 | How many recent PRs to scan |
|
issue_count | 1000 | How many recent issues to scan |
|
pr_state |
all | PR state filter:
open,
closed,
all |
|
issue_state |
open | Issue state filter:
open,
closed,
all |
|
batch_size | 50 | PRs per subagent batch |
|
confidence_threshold |
medium | Minimum confidence to include in report:
low,
medium,
high |
|
mode |
plan |
plan = report only (default, always start here).
execute = act on findings. |
Default mode is plan (dry-run). The skill always starts by generating
the report. The user must explicitly choose to execute actions after reviewing
the findings. This matters because actions can't be undone.
Workflow
Phase 1: Data Collection
Fetch PR and issue metadata from the GitHub API. This phase is deterministic
and uses the shell script — no AI needed.
CODEBLOCK0
This produces:
- -
workspace/prs.json — Full PR metadata - INLINECODE27 — Full issue metadata (PRs filtered out)
- INLINECODE28 — Pre-extracted explicit cross-references
- INLINECODE29 — Compact one-line-per-PR index
- INLINECODE30 — Compact one-line-per-issue index
The existing references map captures what's already linked (via "fixes #N",
"closes #N", etc.) so subagents can focus on what's missing.
Phase 2: Parallel Analysis (Sonnet Subagents)
This is where the intelligence happens. Split PRs into batches and spawn
parallel Sonnet subagents. Each subagent receives:
- - Its batch of PRs (full metadata from prs.json, ~50 PRs)
- The complete issue index (compact, ~60KB)
- The complete PR index (compact, ~60KB) — for duplicate detection
- The existing references map (so it skips already-linked items)
Spawn subagents using the Task tool:
CODEBLOCK1
Subagent prompt template:
Important: When building each subagent prompt, paste the FULL contents of
references/principles.md into the "Decision Principles" section below.
Do not summarize or condense — include the complete text. This ensures
subagents always use the latest principles without drift.
CODEBLOCK2
Parallelism: Spawn ALL batch subagents simultaneously. With batch_size=50
and 1000 PRs, that's 20 parallel subagents. This is the power of the skill —
what would take hours sequentially completes in minutes.
Phase 3: Merge, Deduplicate & Cluster
After all subagents return:
- 1. Collect all JSON results into a single array
- Deduplicate duplicate_pr entries (A→B and B→A are the same link)
- Merge confidence — if two subagents found the same link, take the
higher confidence and merge both evidence strings
- 4. Filter by INLINECODE32
- Build clusters — group related findings into thematic clusters (see below)
- Score clusters by actionability (see below)
- Sort clusters by score (highest first)
Save to workspace/results-unverified.json.
Clustering Algorithm
Instead of reporting isolated pairs, group connected findings into clusters.
Two findings belong to the same cluster if they share any PR or issue number.
Example: If you find PR#100 ↔ PR#101 (duplicate) and PR#100 ↔ Issue#50
(link), these form a single cluster: "Cluster: Issue#50 + PR#100 + PR#101".
Cluster structure:
CODEBLOCK3
The theme is a one-line summary that describes what this cluster is about
— the shared root cause or feature area. Generate it from the root_cause
fields of the cluster's findings.
Actionability Scoring
Each cluster gets a score based on these signals (clamp result to 0-10):
| Signal | Points | Why it matters |
|---|
| All items open | +3 | Can still be acted on |
| At least one high-confidence finding |
+2 | Strong evidence |
| Multiple findings in cluster | +1 | More connections = more value |
| Issue has >5 reactions/comments | +1 | High community interest |
| PR is not draft | +1 | Ready for review |
| Cluster has a clear canonical PR | +1 | Easy to pick a winner |
| Any
manual_review_required | -2 | Needs human judgment |
| All items closed | -3 | Low urgency |
Clusters scoring 7+ are actionable (green in report).
Clusters scoring 4-6 need review (yellow).
Clusters scoring 0-3 are low priority (gray).
Phase 3b: Evidence Verification
The batch subagents work from truncated bodies (500 chars) and compact indexes.
That's good enough for discovery but not for final decisions. This phase takes
the candidates and verifies them against deeper data.
Spawn a single verification subagent (Sonnet) that:
- 1. Reads INLINECODE39
- For each high/medium candidate, fetches deeper evidence via
gh:
-
Duplicate PRs:
gh pr diff {id} --name-only for both PRs to confirm
they actually touch the same files. If the file lists don't overlap at all,
downgrade to
low or remove.
-
Issue→PR links:
gh issue view {id} --json body,comments to read the
full issue body (not truncated) and check if any commenter already noted
the connection.
-
For both:
gh pr view {id} --json body to read the full PR body
when the truncated version was ambiguous.
- 3. For
manual_review_required items: attempt to resolve with deeper data.
If still ambiguous after deep check, keep the flag — it goes to the user.
- 4. Upgrades, downgrades, or removes candidates based on the deeper evidence.
- Recalculates cluster scores after confidence changes.
- Writes the verified results to
workspace/results.json.
Verification subagent prompt:
CODEBLOCK4
This phase catches false positives that slipped through the discovery phase.
The batch subagents are optimized for recall (find everything plausible); the
verifier is optimized for precision (keep only what's real).
Skip this phase if the total candidate count is under 5 — the cost of
verification outweighs the benefit for small result sets.
Phase 4: Generate Report
Present the report to the user organized by clusters, not flat pairs.
Report structure:
CODEBLOCK5
Phase 4b: Suggested Actions Per Cluster
For each cluster, suggest appropriate actions based on confidence and item states.
For duplicate PRs (high confidence, both open):
- 1. 💬 Comment — link the PRs so authors can coordinate
- 🏷️ Label — add
duplicate label to the weaker PR - ❌ Close — close the weaker PR as duplicate (only if very clear)
For duplicate PRs (one open, one closed):
- 1. 💬 Comment — note the connection for context (lower priority)
For issue→PR links (high confidence):
- 1. 💬 Comment on issue — note that a PR addresses this
- 🏷️ Label issue — add
has-pr or similar
For manual_review_required items:
- 1. ⚠️ Flag for human — present in a separate section, no automated action
Action rules:
- - Never suggest closing without high confidence + verification
- Never suggest labeling without at least medium confidence
- Always suggest commenting as the minimum action (it's the safest)
- For clusters with mixed confidence, suggest the action matching the
lowest-confidence finding (conservative)
Phase 5: Interactive Action Strategy
After presenting the report, ask the user how they want to proceed.
Read references/commenting-strategy.md for rate-limiting details.
Present action choices per cluster:
For each actionable cluster, let the user pick:
- - Comment only — just link the items
- Comment + label — link and add labels
- Comment + close — link and close duplicates (high confidence only)
- Skip — do nothing for this cluster
- Manual — I'll handle this one myself
Then present the timing strategy. Read references/commenting-strategy.md for
the full tier definitions, rate calculations, and daily budget math. Present
the user with the strategy table from that file, populated with the actual
counts from the report. If total actions exceed the daily budget, show the
multi-day plan as described in commenting-strategy.md.
Always offer Dry Run (report only, no actions) as the default choice.
Also offer Skip — save the report but don't act at all.
Phase 6: Execute Actions
If the user chooses to act, build workspace/approved-comments.json and
execute with rate limiting via the shell script.
approved-comments.json schema (array of objects):
[
{
"target_number": 1234,
"type": "issue_link|duplicate_pr",
"body": "The full comment text to post",
"cluster_id": 1,
"finding_index": 0
}
]
- -
target_number — the issue or PR number to comment on (used by post-comments.sh) - INLINECODE54 — finding type, used for logging only
- INLINECODE55 — the complete comment text
- INLINECODE56 and
finding_index — traceability back to the report
CODEBLOCK7
For label and close actions, execute them inline (not via the script)
since they don't need the same rate limiting as comments:
CODEBLOCK8
Always execute in this order within a cluster:
- 1. Post comments first (so the context exists before close/label)
- Add labels
- Close (only after comment is posted)
Comment style: Comments should feel like they're from a helpful maintainer,
not a bot. Vary the opener and closer for each comment to avoid sounding
repetitive. Always mention the PR author by name.
Comment templates (vary the opener each time):
Openers (rotate through these, never use the same one twice in a row):
- - "Heads up — this might be related."
- "Worth a look:"
- "Noticed a possible connection here."
- "This could be relevant to what you're working on."
For issue→PR links (comment on the issue):
CODEBLOCK9
For duplicate PRs (comment on the newer PR):
CODEBLOCK10
Every comment includes a correction path because wrong links erode trust.
Save progress to workspace/comment-progress.json for resume support.
Error Handling
- - API rate limit hit: Pause, show remaining reset time, save progress.
- Subagent returns invalid JSON: Log the error, skip that batch, warn user.
Don't retry — the batch results are lost but other batches continue.
- - PR/issue not found (deleted): Skip silently, note in report.
- Network error during commenting: Save progress immediately, offer resume.
- Subagent returns empty results: Normal — not every batch has links.
- Close/label fails: Log the error, continue with remaining actions.
Never retry a close — the user should investigate manually.
Workspace Structure
CODEBLOCK11
Resume Support
If a previous run exists in the workspace:
- - Phase 1-3: Skip if
results.json exists and user confirms - Phase 4: Skip if
report.md exists and user confirms - Phase 5-6: Resume from
comment-progress.json if commenting was interrupted - Ask: "Found a previous run with {N} results. Resume commenting or start fresh?"
Tips for Operators
- - Start with a smaller count (100 PRs, 100 issues) to validate before scaling
- Always review the report in
plan mode before executing actions - The compact index approach keeps memory usage manageable — don't fetch full
PR bodies (500 char truncation is intentional)
- - For very active repos (>10K PRs), increase batchsize to reduce subagent count
- Token costs: ~20 subagent calls for 1000 PRs at batchsize=50, each with
~120KB context. Plan accordingly.
- - The
gh CLI token needs repo scope (private) or public_repo (public),
plus
issues:write for posting comments.
Cross-Ref: PR与Issue关联器
你能够发现人类在大规模场景下容易遗漏的PR与Issue之间的隐藏关联。
核心循环是:获取 → 并行分析 → 聚类 → 验证 → 报告 → 执行。
在执行任何操作之前,请先阅读 references/principles.md。当该文件中的规则与本文件存在冲突时,以该文件中的规则为准。
概述
随着时间的推移,仓库会积累重复的PR以及孤立的Issue→PR链接。手动交叉引用无法扩展到几十个项目以上。此技能使用并行的Sonnet子代理同时分析多达1000个PR和1000个Issue,寻找两种类型的链接:
- 1. 重复PR — 解决相同错误或功能的PR(即使采用不同的方法或措辞)
- Issue→PR链接 — 已有PR解决但未明确标注fixes #N的开放Issue
结果按主题聚类分组,根据可操作性评分,并提供可用的操作(评论、关闭、打标签)——而不仅仅是一对一的扁平列表。
配置
用户在调用时提供以下参数(如果未提供则询问):
| 参数 | 默认值 | 描述 |
|---|
| repo | (询问) | 要分析的GitHub owner/repo |
| pr_count |
1000 | 要扫描的最近PR数量 |
| issue_count | 1000 | 要扫描的最近Issue数量 |
| pr_state | all | PR状态筛选:open,closed,all |
| issue_state | open | Issue状态筛选:open,closed,all |
| batch_size | 50 | 每个子代理批次的PR数量 |
| confidence_threshold | medium | 包含在报告中的最低置信度:low,medium,high |
| mode | plan | plan = 仅报告(默认,始终从此开始)。execute = 对发现结果执行操作。 |
默认模式为 plan(试运行)。该技能始终从生成报告开始。用户必须在审查发现结果后明确选择执行操作。这一点很重要,因为操作无法撤销。
工作流程
阶段 1:数据收集
从GitHub API获取PR和Issue元数据。此阶段是确定性的,使用shell脚本——无需AI。
bash
scripts/fetch-data.sh dir> [prcount] [issuecount] [prstate] [issue_state]
这将生成:
- - workspace/prs.json — 完整的PR元数据
- workspace/issues.json — 完整的Issue元数据(已过滤掉PR)
- workspace/existing-refs.json — 预先提取的显式交叉引用
- workspace/pr-index.txt — 紧凑的每行一个PR的索引
- workspace/issue-index.txt — 紧凑的每行一个Issue的索引
现有的引用映射捕获了已经链接的内容(通过fixes #N、closes #N等),以便子代理可以专注于缺失的内容。
阶段 2:并行分析(Sonnet子代理)
这是智能处理发生的地方。将PR分成批次,并生成并行的Sonnet子代理。每个子代理接收:
- - 其批次的PR(来自prs.json的完整元数据,约50个PR)
- 完整的Issue索引(紧凑型,约60KB)
- 完整的PR索引(紧凑型,约60KB)——用于重复检测
- 现有的引用映射(以便跳过已链接的项目)
使用Task工具生成子代理:
对于每批包含{batch_size}个PR的批次B:
Task(
subagent_type=general-purpose,
model=sonnet,
prompt=<见下文>
)
子代理提示模板:
重要:在构建每个子代理提示时,将 references/principles.md 的完整内容粘贴到下面的决策原则部分。不要总结或压缩——包含完整的文本。这确保子代理始终使用最新的原则,不会发生偏离。
你是一个GitHub仓库的交叉引用分析师。你的工作是找到尚未明确链接的PR和Issue之间的关联。
决策原则(这些原则覆盖其他所有内容)
{在此处粘贴references/principles.md的完整内容}
你的批次
你正在分析{total
prs}个PR中的第{startnum}到{end_num}个。
PR详情(你的批次)
{来自prs.json的此批次的完整PR元数据}
完整的Issue索引
{issue-index.txt内容}
完整的PR索引
{pr-index.txt内容}
已知引用
{existing-refs.json内容}
你的任务
找到两种类型的关联:
1. Issue→PR链接
对于你批次中的每个PR,判断它是否解决了索引中的任何Issue。证据必须至少包含以下一项:
- - 两者描述了相同的错误消息或失败路径
- PR修改了Issue描述为损坏的组件/模块
- PR正文明确引用了Issue描述的问题(即使没有#N)
仅标题相似是不够的。跳过已知引用中已存在的任何链接。
2. 重复PR
对于你批次中的每个PR,检查完整PR索引中是否有任何
其他PR解决了相同的问题。证据必须至少包含以下一项:
- - 两者因相同原因修改了相同的文件
- 两者修复了相同的错误/行为(即使采用不同的方法)
- 一个是另一个的重新提交或延续(相同的分支,相似的正文)
仅代码区域相同是不够的——PR必须解决相同的具体问题。
3. 标记不确定性
如果你遇到证据模糊的配对——你看到可能的关联但无法从现有数据中确认——用status: manualreviewrequired标记它,而不是猜测置信度。包括缺失的内容(例如,需要查看完整差异以确认文件重叠)。
输出格式
仅返回一个JSON数组。不要有其他文本。
[
{
type: issue_link,
pr: 5678,
pr_author: @username,
issue: 1234,
confidence: high|medium|low,
status: confirmed|manualreviewrequired,
root_cause: 一句话:这些关联的共同问题是什么,
evidence: 具体:相同的错误消息,相同的文件,相同的组件等。,
missing_evidence: null 或 确认此关联需要什么
},
{
type: duplicate_pr,
pr_a: 5678,
pr_b: 5679,
praauthor: @username_a,
prbauthor: @username_b,
confidence: high|medium|low,
status: confirmed|manualreviewrequired,
root_cause: 一句话:这些关联的共同问题是什么,
evidence: 具体:修改了相同的文件,相同的分支,重新提交等。,
missing_evidence: null 或 确认此关联需要什么
}
]
并行度:同时生成所有批次子代理。batch_size=50且1000个PR时,即20个并行子代理。这是该技能的优势所在——原本需要数小时顺序完成的工作,现在几分钟内即可完成。
阶段 3:合并、去重与聚类
在所有子代理返回后:
- 1. 收集所有JSON结果到一个数组中
- 去重duplicatepr条目(A→B和B→A是相同的链接)
- 合并置信度——如果两个子代理找到相同的链接,采用较高的置信度并合并两个证据字符串
- 按 confidencethreshold 过滤
- 构建聚类——将相关发现分组到主题聚类中(见下文)
- 按可操作性对聚类进行评分(见下文)
- 按分数对聚类进行排序(最高优先)
保存到 workspace/results-unverified.json。
聚类算法
不报告孤立的配对,而是将相关的发现分组到聚类中。如果两个发现共享任何PR或Issue编号,则它们属于同一个聚类。
示例:如果你发现 PR#100 ↔ PR#101(重复)和 PR#100 ↔ Issue#50(链接),这些形成一个单一的聚类:聚类:Issue#50 + PR#100 + PR#101。
聚类结构:
json
{
cluster_id: 1,
theme: Onboard令牌不匹配 — OPENCLAWGATEWAYTOKEN被忽略,
items: [PR#22662, PR#22658, Issue#22638],
findings: [ ...此聚类中的各个发现... ],
score: 8.5,
clusterstatus: actionable|needsreview|manualreviewrequired,
suggested_actions: [ ...见阶段4b... ]
}
theme 是一行摘要,描述此聚类是关于什么的——共享的根本原因或功能领域。从