Personal Genomics Analysis Skill

Overview

This skill guides you through a structured, multi-phase workflow for analyzing consumer
genetic testing data and producing actionable health insights. The workflow is interactive —
you gather information from the user at key decision points rather than making assumptions.

The analysis pipeline is designed to be:

- Evidence-based: every risk assessment cites published research (PMIDs)
Interactive: the user's medical history, lifestyle, and concerns shape the analysis
Progressive: start broad, then deep-dive into areas that matter most to the user
Actionable: end with concrete recommendations (supplements, lifestyle, screening schedule)

Phase 1: Data Intake & Format Detection

Supported Input Formats

Read references/supported_formats.md for detailed format specifications. In brief:

Platform	File Type	Key Characteristics
WeGene	TSV (.txt)	INLINECODE1
23andMe

What to Do

1. List the user's uploaded files and identify their formats by reading the first 20-50 lines
Report back what you found: platform, number of variants, reference genome build (GRCh37/GRCh38 if detectable), data quality indicators
Ask the user what they'd like to focus on. Present the available analysis modules:

- Health risk assessment (disease predisposition) - Pharmacogenomics (drug metabolism & response) - Nutrition & metabolism genetics - Exercise & fitness genetics - Ancestry (mtDNA/Y haplogroups if WGS data available) - All of the above (recommended for first-time analysis)

Parsing Strategy

Write a Python script that:

- Auto-detects the input format from file headers
Builds a unified genotype dictionary: INLINECODE5
For VCF files, also indexes by chr:pos for position-based lookups
Handles both compressed (.gz) and uncompressed files
Reports parsing statistics (total variants, by chromosome, etc.)

When both chip data (WeGene/23andMe) and WGS (VCF) are available, use a dual-source
lookup strategy: check chip data first (faster), fall back to VCF by rsid or chr:pos.
This maximizes coverage since chip and WGS may cover different variant sets.

Phase 2: Initial Comprehensive Analysis

SNP Database

Read references/snp_database.md for the curated SNP database organized by category.
The database covers ~120 clinically relevant SNPs across these categories:

- Health risks: cancer (BRCA1/2), cardiovascular (9p21.3, MTHFR), metabolic (TCF7L2),

neurological (APOE, LRRK2), autoimmune, and more

- Pharmacogenomics: CYP2C19, CYP2D6, CYP2C9, CYP1A2, SLCO1B1, VKORC1, ALDH2, etc.
Nutrition: lactose tolerance (MCM6), vitamin metabolism (MTHFR, VDR, BCMO1, FUT2),

caffeine sensitivity (CYP1A2), alcohol flush (ALDH2)

- Exercise: muscle fiber type (ACTN3), endurance (PPARGC1A), recovery (IL6), VO2max (ACE)

Each SNP entry includes: gene, variant name, risk allele, condition/trait, evidence level,
PMID reference, and a plain-language explanation.

Analysis Script Structure

Generate a Python analysis script that:

1. Loads the unified genotype dictionary from Phase 1
Looks up each SNP in the database
Determines risk level based on genotype (homozygous risk, heterozygous, or normal)
Handles special cases:

- APOE typing: requires combining rs429358 + rs7412 to determine ε2/ε3/ε4 status - CYP2C19 metabolizer status: combines multiple star-allele SNPs - MTHFR compound: checks both C677T (rs1801133) and A1298C (rs1801131)

5. Generates an HTML report with:

- Summary dashboard (key findings, risk counts by category) - Tabbed sections for each category - Color-coded risk levels (high/medium/low/protective) - Citations for each finding

Report Output

Generate an interactive HTML report with:

- Clean, readable design with high contrast (dark text on light backgrounds)
Sticky navigation tabs
Risk indicators with clear color coding
Expandable detail sections for each SNP
A summary section with the most clinically significant findings

Follow the user's language (Chinese or English) for all report text.

Phase 3: User Interview & Deep Dive

This is the critical interactive phase. After presenting initial results:

Gather Context

Ask the user about:

1. Known health conditions — what diagnoses do they already have?
Family history — especially first-degree relatives with serious conditions
Current medications — for drug interaction awareness
Lifestyle factors — diet, exercise, sun exposure, smoking/alcohol
Specific concerns — what worries them most?

This information is essential because genetic risk is only part of the picture. A person
with a family history of early heart attack AND multiple CAD risk SNPs faces very different
odds than someone with the same SNPs but no family history.

Deep Risk Analysis

Based on the user's health profile, conduct a targeted deep-dive. Read
references/deep_risk_snps.md for extended SNP panels organized by disease pathway:

- Lipid metabolism (~20 SNPs): LDLR, APOB, PCSK9, HMGCR, CETP, LPL, APOA5, etc.
Coronary artery disease (~15 SNPs): 9p21.3, LPA, MTHFR, CRP, IL6, F5, F2, etc.
Uric acid / gout (~10 SNPs): SLC2A9, ABCG2, SLC22A12, SLC17A1, etc.
Diabetes risk (~10 SNPs): TCF7L2, KCNJ11, SLC30A8, PPARG, FTO, etc.
Statin pharmacogenomics (~5 SNPs): SLCO1B1, CYP3A4, ABCB1, etc.

For each category relevant to the user:

1. Query ALL SNPs in the extended panel (use both chip + VCF dual-source)
Tally risk alleles and categorize (high/moderate/low/protective)
Compute a qualitative risk profile (not a numeric "score" — explain why)
Cross-reference with the user's actual health status and family history
Note any SNPs that could NOT be found (missing data)

Variant Verification (if CRAM/BAM available)

If the user has provided alignment files:

- Use samtools/bcftools to verify key high-risk variants directly from reads
Report read depth and allele balance for critical SNPs
Flag any low-confidence calls

Note: samtools may need to be compiled from source in sandboxed environments.
See references/tool_setup.md for instructions.

Ancestry Analysis (if WGS available)

For whole-genome sequencing data:

- mtDNA haplogroup: Check diagnostic variants against PhyloTree. Important: VCF

files report variants against rCRS (which is haplogroup H). Absence of a variant
means the person carries the rCRS allele at that position. Look for the 9bp deletion
at position 8270-8278 (B haplogroup marker, common in East Asian populations).

- Y chromosome haplogroup (if male): Check ISOGG diagnostic SNPs (e.g., M122 for

O2 haplogroup, common in East Asian populations).

Phase 4: Personalized Recommendations

Based on all gathered information, produce actionable recommendations.

Supplement Plan

Read references/supplement_guide.md for evidence-based supplement recommendations
mapped to genetic findings. The guide covers:

- Which genetic variants warrant which supplements
Dosage ranges with citations
Drug-supplement interactions to watch for
Priority tiers (core / recommended / optional)
Age-specific timing and duration advice
When to recheck labs

Always organize supplements into tiers:

1. Core: strongly supported by genetics + current health status
Recommended: good evidence, beneficial given risk profile
Optional: supporting evidence, lower priority

Screening & Monitoring Schedule

Based on the risk profile, suggest:

- Which lab tests to monitor and how often
Age milestones for specific screenings (e.g., coronary CTA at 30 if strong family history)
Target values for key metrics

Output Formats

Offer to generate:

- HTML report — comprehensive, interactive, printable
Excel spreadsheet — dosing schedule table for daily reference
Summary document — one-page overview for sharing with a physician

Important Principles

Medical Disclaimer

Every report MUST include a clear disclaimer: genetic analysis provides risk estimates, not diagnoses. Results should be discussed with a qualified healthcare provider. Consumer genetic testing has limitations in coverage and accuracy compared to clinical-grade testing.

Evidence Standards

- Always cite PMIDs for risk associations
Distinguish between GWAS-level evidence and functional/clinical evidence
Note when evidence is primarily from non-Asian populations (if the user appears to be

of East Asian descent based on their data or stated ethnicity)

- Use language like "increased risk" rather than "you will get"

Language

Follow the user's language. If the user writes in Chinese, produce reports in Chinese. If in English, use English. For SNP names and gene symbols, always keep the standard scientific nomenclature regardless of language.

Iterative Approach

Don't try to do everything at once. The workflow is designed as a conversation:

1. Parse → show what you found → ask what to focus on
Initial analysis → present results → gather health context
Deep dive → present findings → discuss implications
Recommendations → deliver in requested format

Each phase should end with a clear handoff to the user before proceeding.

个人基因组分析技能

概述

本技能引导您通过一个结构化、多阶段的工作流程来分析消费者基因检测数据，并生成可执行的健康洞察。该工作流程是交互式的——您在关键决策点从用户处收集信息，而非自行假设。

分析流程的设计原则：

- 基于证据：每项风险评估均引用已发表的研究（PMID）
交互式：用户的病史、生活方式和关注点塑造分析方向
渐进式：从广泛入手，然后深入用户最关心的领域
可执行：最终给出具体建议（补充剂、生活方式、筛查计划）

第一阶段：数据摄入与格式检测

支持的输入格式

阅读 references/supported_formats.md 获取详细的格式规范。简要说明：

平台	文件类型	关键特征
WeGene	TSV (.txt)	rsid \t 染色体 \t 位置 \t 基因型
23andMe

操作步骤

1. 列出用户上传的文件，通过读取前20-50行识别其格式
报告发现：平台、变异数量、参考基因组版本（如可检测，GRCh37/GRCh38）、数据质量指标
询问用户希望关注什么。展示可用的分析模块：

- 健康风险评估（疾病易感性） - 药物基因组学（药物代谢与反应） - 营养与代谢遗传学 - 运动与健身遗传学 - 祖源分析（如有全基因组测序数据，线粒体DNA/Y染色体单倍群） - 以上全部（首次分析推荐）

解析策略

编写一个Python脚本，实现：

- 从文件头自动检测输入格式
构建统一的基因型字典：{rsid: 基因型字符串}
对于VCF文件，同时按chr:pos索引以支持基于位置的查询
处理压缩（.gz）和未压缩文件
报告解析统计信息（总变异数、按染色体分类等）

当同时有芯片数据（WeGene/23andMe）和全基因组测序数据（VCF）时，采用双源查询策略：优先检查芯片数据（更快），若未找到则按rsid或chr:pos回退到VCF。这能最大化覆盖范围，因为芯片和全基因组测序可能覆盖不同的变异集。

第二阶段：初步综合分析

SNP数据库

阅读 references/snp_database.md 获取按类别组织的精选SNP数据库。该数据库涵盖约120个临床相关SNP，类别包括：

- 健康风险：癌症（BRCA1/2）、心血管（9p21.3、MTHFR）、代谢（TCF7L2）、神经（APOE、LRRK2）、自身免疫等
药物基因组学：CYP2C19、CYP2D6、CYP2C9、CYP1A2、SLCO1B1、VKORC1、ALDH2等
营养：乳糖耐受（MCM6）、维生素代谢（MTHFR、VDR、BCMO1、FUT2）、咖啡因敏感性（CYP1A2）、酒精脸红（ALDH2）
运动：肌纤维类型（ACTN3）、耐力（PPARGC1A）、恢复（IL6）、最大摄氧量（ACE）

每个SNP条目包括：基因、变异名称、风险等位基因、条件/性状、证据等级、PMID参考文献以及通俗易懂的解释。

分析脚本结构

生成一个Python分析脚本，实现：

1. 加载第一阶段构建的统一基因型字典
在数据库中查询每个SNP
根据基因型确定风险等级（纯合风险、杂合或正常）
处理特殊情况：

- APOE分型：需结合rs429358 + rs7412确定ε2/ε3/ε4状态 - CYP2C19代谢者状态：结合多个星号等位基因SNP - MTHFR复合：同时检查C677T（rs1801133）和A1298C（rs1801131）

5. 生成HTML报告，包含：

- 摘要仪表板（关键发现、按类别统计的风险数量） - 每个类别的选项卡式章节 - 颜色编码的风险等级（高/中/低/保护性） - 每项发现的引用来源

报告输出

生成一个交互式HTML报告，包含：

- 清晰、可读的设计，高对比度（浅色背景上的深色文字）
固定导航选项卡
带有明确颜色编码的风险指标
每个SNP的可展开详细章节
包含最具临床意义发现的摘要部分

报告文本遵循用户的语言（中文或英文）。

第三阶段：用户访谈与深度分析

这是关键的交互阶段。在展示初步结果后：

收集背景信息

询问用户关于：

1. 已知健康状况——他们已有哪些诊断？
家族史——尤其是一级亲属的严重疾病史
当前用药——用于药物相互作用意识
生活方式因素——饮食、运动、日晒、吸烟/饮酒
具体担忧——他们最担心什么？

这些信息至关重要，因为遗传风险只是整体情况的一部分。一个有早发心脏病家族史且携带多个冠心病风险SNP的人，与具有相同SNP但无家族史的人面临的风险截然不同。

深度风险分析

根据用户的健康档案，进行有针对性的深度分析。阅读 references/deeprisksnps.md 获取按疾病通路组织的扩展SNP面板：

- 脂质代谢（约20个SNP）：LDLR、APOB、PCSK9、HMGCR、CETP、LPL、APOA5等
冠状动脉疾病（约15个SNP）：9p21.3、LPA、MTHFR、CRP、IL6、F5、F2等
尿酸/痛风（约10个SNP）：SLC2A9、ABCG2、SLC22A12、SLC17A1等
糖尿病风险（约10个SNP）：TCF7L2、KCNJ11、SLC30A8、PPARG、FTO等
他汀类药物基因组学（约5个SNP）：SLCO1B1、CYP3A4、ABCB1等

对于与用户相关的每个类别：

1. 查询扩展面板中的所有SNP（使用芯片+全基因组测序双源）
统计风险等位基因并分类（高/中/低/保护性）
计算定性风险概况（非数字评分——解释原因）
与用户的实际健康状况和家族史交叉参考
记录任何未能找到的SNP（缺失数据）

变异验证（如有CRAM/BAM文件）

如果用户提供了比对文件：

- 使用samtools/bcftools直接从读段验证关键高风险变异
报告关键SNP的读段深度和等位基因平衡
标记任何低置信度调用

注意：在沙盒环境中可能需要从源代码编译samtools。参见 references/tool_setup.md 获取说明。

祖源分析（如有全基因组测序数据）

对于全基因组测序数据：

- 线粒体DNA单倍群：对照PhyloTree检查诊断性变异。重要提示：VCF文件报告的是相对于rCRS（属于单倍群H）的变异。未出现变异意味着该人在该位置携带rCRS等位基因。查找位置8270-8278的9bp缺失（B单倍群标记，在东亚人群中常见）。
Y染色体单倍群（如为男性）：检查ISOGG诊断性SNP（例如，O2单倍群的M122，在东亚人群中常见）。

第四阶段：个性化建议

基于所有收集到的信息，生成可执行的建议。

补充剂计划

阅读 references/supplement_guide.md 获取基于证据的补充剂建议，与遗传发现相对应。该指南涵盖：

- 哪些遗传变异需要哪些补充剂
带有引用的剂量范围
需注意的药物-补充剂相互作用
优先级层级（核心/推荐/可选）
特定年龄的时机和持续时间建议
何时重新检查实验室指标

始终将补充剂按

personal-genomics个人基因组