Personal Genomics Analysis Skill
Overview
This skill guides you through a structured, multi-phase workflow for analyzing consumer
genetic testing data and producing actionable health insights. The workflow is interactive —
you gather information from the user at key decision points rather than making assumptions.
The analysis pipeline is designed to be:
- - Evidence-based: every risk assessment cites published research (PMIDs)
- Interactive: the user's medical history, lifestyle, and concerns shape the analysis
- Progressive: start broad, then deep-dive into areas that matter most to the user
- Actionable: end with concrete recommendations (supplements, lifestyle, screening schedule)
Phase 1: Data Intake & Format Detection
Supported Input Formats
Read references/supported_formats.md for detailed format specifications. In brief:
| Platform | File Type | Key Characteristics |
|---|
| WeGene | TSV (.txt) | INLINECODE1 |
| 23andMe |
TSV (.txt) |
# rsid \t chromosome \t position \t genotype (comment header with
#) |
| AncestryDNA | TSV (.txt) |
rsid \t chromosome \t position \t allele1 \t allele2 (separate allele columns) |
| VCF | .vcf / .vcf.gz | Standard VCF v4.x, may contain WGS or chip data |
| CRAM/BAM | .cram / .bam | Alignment files for variant verification, depth analysis |
What to Do
- 1. List the user's uploaded files and identify their formats by reading the first 20-50 lines
- Report back what you found: platform, number of variants, reference genome build (GRCh37/GRCh38 if detectable), data quality indicators
- Ask the user what they'd like to focus on. Present the available analysis modules:
- Health risk assessment (disease predisposition)
- Pharmacogenomics (drug metabolism & response)
- Nutrition & metabolism genetics
- Exercise & fitness genetics
- Ancestry (mtDNA/Y haplogroups if WGS data available)
- All of the above (recommended for first-time analysis)
Parsing Strategy
Write a Python script that:
- - Auto-detects the input format from file headers
- Builds a unified genotype dictionary: INLINECODE5
- For VCF files, also indexes by
chr:pos for position-based lookups - Handles both compressed (.gz) and uncompressed files
- Reports parsing statistics (total variants, by chromosome, etc.)
When both chip data (WeGene/23andMe) and WGS (VCF) are available, use a dual-source
lookup strategy: check chip data first (faster), fall back to VCF by rsid or chr:pos.
This maximizes coverage since chip and WGS may cover different variant sets.
Phase 2: Initial Comprehensive Analysis
SNP Database
Read references/snp_database.md for the curated SNP database organized by category.
The database covers ~120 clinically relevant SNPs across these categories:
- - Health risks: cancer (BRCA1/2), cardiovascular (9p21.3, MTHFR), metabolic (TCF7L2),
neurological (APOE, LRRK2), autoimmune, and more
- - Pharmacogenomics: CYP2C19, CYP2D6, CYP2C9, CYP1A2, SLCO1B1, VKORC1, ALDH2, etc.
- Nutrition: lactose tolerance (MCM6), vitamin metabolism (MTHFR, VDR, BCMO1, FUT2),
caffeine sensitivity (CYP1A2), alcohol flush (ALDH2)
- - Exercise: muscle fiber type (ACTN3), endurance (PPARGC1A), recovery (IL6), VO2max (ACE)
Each SNP entry includes: gene, variant name, risk allele, condition/trait, evidence level,
PMID reference, and a plain-language explanation.
Analysis Script Structure
Generate a Python analysis script that:
- 1. Loads the unified genotype dictionary from Phase 1
- Looks up each SNP in the database
- Determines risk level based on genotype (homozygous risk, heterozygous, or normal)
- Handles special cases:
-
APOE typing: requires combining rs429358 + rs7412 to determine ε2/ε3/ε4 status
-
CYP2C19 metabolizer status: combines multiple star-allele SNPs
-
MTHFR compound: checks both C677T (rs1801133) and A1298C (rs1801131)
- 5. Generates an HTML report with:
- Summary dashboard (key findings, risk counts by category)
- Tabbed sections for each category
- Color-coded risk levels (high/medium/low/protective)
- Citations for each finding
Report Output
Generate an interactive HTML report with:
- - Clean, readable design with high contrast (dark text on light backgrounds)
- Sticky navigation tabs
- Risk indicators with clear color coding
- Expandable detail sections for each SNP
- A summary section with the most clinically significant findings
Follow the user's language (Chinese or English) for all report text.
Phase 3: User Interview & Deep Dive
This is the critical interactive phase. After presenting initial results:
Gather Context
Ask the user about:
- 1. Known health conditions — what diagnoses do they already have?
- Family history — especially first-degree relatives with serious conditions
- Current medications — for drug interaction awareness
- Lifestyle factors — diet, exercise, sun exposure, smoking/alcohol
- Specific concerns — what worries them most?
This information is essential because genetic risk is only part of the picture. A person
with a family history of early heart attack AND multiple CAD risk SNPs faces very different
odds than someone with the same SNPs but no family history.
Deep Risk Analysis
Based on the user's health profile, conduct a targeted deep-dive. Read
references/deep_risk_snps.md for extended SNP panels organized by disease pathway:
- - Lipid metabolism (~20 SNPs): LDLR, APOB, PCSK9, HMGCR, CETP, LPL, APOA5, etc.
- Coronary artery disease (~15 SNPs): 9p21.3, LPA, MTHFR, CRP, IL6, F5, F2, etc.
- Uric acid / gout (~10 SNPs): SLC2A9, ABCG2, SLC22A12, SLC17A1, etc.
- Diabetes risk (~10 SNPs): TCF7L2, KCNJ11, SLC30A8, PPARG, FTO, etc.
- Statin pharmacogenomics (~5 SNPs): SLCO1B1, CYP3A4, ABCB1, etc.
For each category relevant to the user:
- 1. Query ALL SNPs in the extended panel (use both chip + VCF dual-source)
- Tally risk alleles and categorize (high/moderate/low/protective)
- Compute a qualitative risk profile (not a numeric "score" — explain why)
- Cross-reference with the user's actual health status and family history
- Note any SNPs that could NOT be found (missing data)
Variant Verification (if CRAM/BAM available)
If the user has provided alignment files:
- - Use samtools/bcftools to verify key high-risk variants directly from reads
- Report read depth and allele balance for critical SNPs
- Flag any low-confidence calls
Note: samtools may need to be compiled from source in sandboxed environments.
See references/tool_setup.md for instructions.
Ancestry Analysis (if WGS available)
For whole-genome sequencing data:
- - mtDNA haplogroup: Check diagnostic variants against PhyloTree. Important: VCF
files report variants against rCRS (which is haplogroup H). Absence of a variant
means the person carries the rCRS allele at that position. Look for the 9bp deletion
at position 8270-8278 (B haplogroup marker, common in East Asian populations).
- - Y chromosome haplogroup (if male): Check ISOGG diagnostic SNPs (e.g., M122 for
O2 haplogroup, common in East Asian populations).
Phase 4: Personalized Recommendations
Based on all gathered information, produce actionable recommendations.
Supplement Plan
Read references/supplement_guide.md for evidence-based supplement recommendations
mapped to genetic findings. The guide covers:
- - Which genetic variants warrant which supplements
- Dosage ranges with citations
- Drug-supplement interactions to watch for
- Priority tiers (core / recommended / optional)
- Age-specific timing and duration advice
- When to recheck labs
Always organize supplements into tiers:
- 1. Core: strongly supported by genetics + current health status
- Recommended: good evidence, beneficial given risk profile
- Optional: supporting evidence, lower priority
Screening & Monitoring Schedule
Based on the risk profile, suggest:
- - Which lab tests to monitor and how often
- Age milestones for specific screenings (e.g., coronary CTA at 30 if strong family history)
- Target values for key metrics
Output Formats
Offer to generate:
- - HTML report — comprehensive, interactive, printable
- Excel spreadsheet — dosing schedule table for daily reference
- Summary document — one-page overview for sharing with a physician
Important Principles
Medical Disclaimer
Every report MUST include a clear disclaimer: genetic analysis provides risk estimates,
not diagnoses. Results should be discussed with a qualified healthcare provider. Consumer
genetic testing has limitations in coverage and accuracy compared to clinical-grade testing.
Evidence Standards
- - Always cite PMIDs for risk associations
- Distinguish between GWAS-level evidence and functional/clinical evidence
- Note when evidence is primarily from non-Asian populations (if the user appears to be
of East Asian descent based on their data or stated ethnicity)
- - Use language like "increased risk" rather than "you will get"
Language
Follow the user's language. If the user writes in Chinese, produce reports in Chinese.
If in English, use English. For SNP names and gene symbols, always keep the standard
scientific nomenclature regardless of language.
Iterative Approach
Don't try to do everything at once. The workflow is designed as a conversation:
- 1. Parse → show what you found → ask what to focus on
- Initial analysis → present results → gather health context
- Deep dive → present findings → discuss implications
- Recommendations → deliver in requested format
Each phase should end with a clear handoff to the user before proceeding.
个人基因组分析技能
概述
本技能引导您通过一个结构化、多阶段的工作流程来分析消费者基因检测数据,并生成可执行的健康洞察。该工作流程是交互式的——您在关键决策点从用户处收集信息,而非自行假设。
分析流程的设计原则:
- - 基于证据:每项风险评估均引用已发表的研究(PMID)
- 交互式:用户的病史、生活方式和关注点塑造分析方向
- 渐进式:从广泛入手,然后深入用户最关心的领域
- 可执行:最终给出具体建议(补充剂、生活方式、筛查计划)
第一阶段:数据摄入与格式检测
支持的输入格式
阅读 references/supported_formats.md 获取详细的格式规范。简要说明:
| 平台 | 文件类型 | 关键特征 |
|---|
| WeGene | TSV (.txt) | rsid \t 染色体 \t 位置 \t 基因型 |
| 23andMe |
TSV (.txt) | # rsid \t 染色体 \t 位置 \t 基因型(注释头带 #) |
| AncestryDNA | TSV (.txt) | rsid \t 染色体 \t 位置 \t 等位基因1 \t 等位基因2(分开的等位基因列) |
| VCF | .vcf / .vcf.gz | 标准 VCF v4.x,可能包含全基因组测序或芯片数据 |
| CRAM/BAM | .cram / .bam | 用于变异验证、深度分析的比对文件 |
操作步骤
- 1. 列出用户上传的文件,通过读取前20-50行识别其格式
- 报告发现:平台、变异数量、参考基因组版本(如可检测,GRCh37/GRCh38)、数据质量指标
- 询问用户希望关注什么。展示可用的分析模块:
- 健康风险评估(疾病易感性)
- 药物基因组学(药物代谢与反应)
- 营养与代谢遗传学
- 运动与健身遗传学
- 祖源分析(如有全基因组测序数据,线粒体DNA/Y染色体单倍群)
- 以上全部(首次分析推荐)
解析策略
编写一个Python脚本,实现:
- - 从文件头自动检测输入格式
- 构建统一的基因型字典:{rsid: 基因型字符串}
- 对于VCF文件,同时按chr:pos索引以支持基于位置的查询
- 处理压缩(.gz)和未压缩文件
- 报告解析统计信息(总变异数、按染色体分类等)
当同时有芯片数据(WeGene/23andMe)和全基因组测序数据(VCF)时,采用双源查询策略:优先检查芯片数据(更快),若未找到则按rsid或chr:pos回退到VCF。这能最大化覆盖范围,因为芯片和全基因组测序可能覆盖不同的变异集。
第二阶段:初步综合分析
SNP数据库
阅读 references/snp_database.md 获取按类别组织的精选SNP数据库。该数据库涵盖约120个临床相关SNP,类别包括:
- - 健康风险:癌症(BRCA1/2)、心血管(9p21.3、MTHFR)、代谢(TCF7L2)、神经(APOE、LRRK2)、自身免疫等
- 药物基因组学:CYP2C19、CYP2D6、CYP2C9、CYP1A2、SLCO1B1、VKORC1、ALDH2等
- 营养:乳糖耐受(MCM6)、维生素代谢(MTHFR、VDR、BCMO1、FUT2)、咖啡因敏感性(CYP1A2)、酒精脸红(ALDH2)
- 运动:肌纤维类型(ACTN3)、耐力(PPARGC1A)、恢复(IL6)、最大摄氧量(ACE)
每个SNP条目包括:基因、变异名称、风险等位基因、条件/性状、证据等级、PMID参考文献以及通俗易懂的解释。
分析脚本结构
生成一个Python分析脚本,实现:
- 1. 加载第一阶段构建的统一基因型字典
- 在数据库中查询每个SNP
- 根据基因型确定风险等级(纯合风险、杂合或正常)
- 处理特殊情况:
-
APOE分型:需结合rs429358 + rs7412确定ε2/ε3/ε4状态
-
CYP2C19代谢者状态:结合多个星号等位基因SNP
-
MTHFR复合:同时检查C677T(rs1801133)和A1298C(rs1801131)
- 5. 生成HTML报告,包含:
- 摘要仪表板(关键发现、按类别统计的风险数量)
- 每个类别的选项卡式章节
- 颜色编码的风险等级(高/中/低/保护性)
- 每项发现的引用来源
报告输出
生成一个交互式HTML报告,包含:
- - 清晰、可读的设计,高对比度(浅色背景上的深色文字)
- 固定导航选项卡
- 带有明确颜色编码的风险指标
- 每个SNP的可展开详细章节
- 包含最具临床意义发现的摘要部分
报告文本遵循用户的语言(中文或英文)。
第三阶段:用户访谈与深度分析
这是关键的交互阶段。在展示初步结果后:
收集背景信息
询问用户关于:
- 1. 已知健康状况——他们已有哪些诊断?
- 家族史——尤其是一级亲属的严重疾病史
- 当前用药——用于药物相互作用意识
- 生活方式因素——饮食、运动、日晒、吸烟/饮酒
- 具体担忧——他们最担心什么?
这些信息至关重要,因为遗传风险只是整体情况的一部分。一个有早发心脏病家族史且携带多个冠心病风险SNP的人,与具有相同SNP但无家族史的人面临的风险截然不同。
深度风险分析
根据用户的健康档案,进行有针对性的深度分析。阅读 references/deeprisksnps.md 获取按疾病通路组织的扩展SNP面板:
- - 脂质代谢(约20个SNP):LDLR、APOB、PCSK9、HMGCR、CETP、LPL、APOA5等
- 冠状动脉疾病(约15个SNP):9p21.3、LPA、MTHFR、CRP、IL6、F5、F2等
- 尿酸/痛风(约10个SNP):SLC2A9、ABCG2、SLC22A12、SLC17A1等
- 糖尿病风险(约10个SNP):TCF7L2、KCNJ11、SLC30A8、PPARG、FTO等
- 他汀类药物基因组学(约5个SNP):SLCO1B1、CYP3A4、ABCB1等
对于与用户相关的每个类别:
- 1. 查询扩展面板中的所有SNP(使用芯片+全基因组测序双源)
- 统计风险等位基因并分类(高/中/低/保护性)
- 计算定性风险概况(非数字评分——解释原因)
- 与用户的实际健康状况和家族史交叉参考
- 记录任何未能找到的SNP(缺失数据)
变异验证(如有CRAM/BAM文件)
如果用户提供了比对文件:
- - 使用samtools/bcftools直接从读段验证关键高风险变异
- 报告关键SNP的读段深度和等位基因平衡
- 标记任何低置信度调用
注意:在沙盒环境中可能需要从源代码编译samtools。参见 references/tool_setup.md 获取说明。
祖源分析(如有全基因组测序数据)
对于全基因组测序数据:
- - 线粒体DNA单倍群:对照PhyloTree检查诊断性变异。重要提示:VCF文件报告的是相对于rCRS(属于单倍群H)的变异。未出现变异意味着该人在该位置携带rCRS等位基因。查找位置8270-8278的9bp缺失(B单倍群标记,在东亚人群中常见)。
- Y染色体单倍群(如为男性):检查ISOGG诊断性SNP(例如,O2单倍群的M122,在东亚人群中常见)。
第四阶段:个性化建议
基于所有收集到的信息,生成可执行的建议。
补充剂计划
阅读 references/supplement_guide.md 获取基于证据的补充剂建议,与遗传发现相对应。该指南涵盖:
- - 哪些遗传变异需要哪些补充剂
- 带有引用的剂量范围
- 需注意的药物-补充剂相互作用
- 优先级层级(核心/推荐/可选)
- 特定年龄的时机和持续时间建议
- 何时重新检查实验室指标
始终将补充剂按