CNV Caller & Plotter

Detect copy number variations (CNVs) from whole genome sequencing (WGS) data and generate genome-wide visualization plots for cancer genomics, rare disease analysis, and population genetics studies. Provides CNV calling, segmentation analysis, and publication-ready visualization.

Key Capabilities:

- CNV Detection from WGS: Identify copy number gains and losses from aligned sequencing data
Genomic Segmentation: Divide genome into bins/windows for copy number estimation
Flexible Input Support: Process BAM, VCF, and other standard genomics formats
Publication-Quality Plots: Generate genome-wide CNV profiles in PNG, PDF, or SVG formats
Standard Output Formats: Export CNV calls in BED format for downstream analysis

When to Use

✅ Use this skill when:

- Analyzing cancer genomes to identify somatic copy number alterations (SCNAs)
Studying rare diseases with suspected copy number variation etiology
Performing population genetics studies comparing CNV frequencies across groups
Generating genome-wide CNV visualizations for publications or reports
Creating BED format CNV calls for integration with other analysis pipelines
Performing comparative CNV analysis between tumor and normal samples
Validating CNV calls from SNP arrays with sequencing data

❌ Do NOT use when:

- Working with targeted sequencing panels (exome/targeted capture) → Use specialized tools like CNVkit or ExomeDepth
Detecting structural variations involving translocations or inversions → Use INLINECODE0
Analyzing single-cell RNA-seq data → Use single-cell specific CNV tools (e.g., inferCNV)
Detecting small indels (<50bp) → Use variant-caller for small variant detection
Requiring clinical-grade CNV detection for diagnostic purposes → Use validated clinical pipelines with proper QC
Working with low-coverage data (<10x) → Results may be unreliable; consider SNP array-based methods

Related Skills:

- 上游 (Upstream): fastqc-report-interpreter, alignment-quality-checker, INLINECODE4
下游 (Downstream): circos-plot-generator, go-kegg-enrichment, INLINECODE7

Integration with Other Skills

Upstream Skills:

- fastqc-report-interpreter: Assess sequencing quality before CNV calling; low quality data may produce unreliable CNVs
INLINECODE9: Verify BAM file quality and coverage uniformity; uneven coverage causes CNV artifacts
INLINECODE10: Generate SNV/indel calls for combined CNV-SNV analysis in cancer samples

Downstream Skills:

- circos-plot-generator: Create circular genome plots integrating CNVs with other genomic features
INLINECODE12: Perform pathway enrichment on genes within CNV regions
INLINECODE13: Visualize CNV profiles across multiple samples

Complete Workflow:

Raw WGS Data → fastqc-report-interpreter → alignment-quality-checker → cnv-caller-plotter → circos-plot-generator → Publication Figures

Core Capabilities

1. Copy Number Variation Detection

Identify genomic regions with copy number gains (amplifications) or losses (deletions) from WGS data by analyzing read depth patterns.

CODEBLOCK1

Parameters:

Parameter	Type	Required	Description	Default
INLINECODE14	str	Yes	Path to input BAM or VCF file	None
INLINECODE15

str | Yes | Path to reference genome FASTA | None |
| bin_size | int | No | Size of genomic bins for segmentation (bp) | 1000 |

CNV Calling Strategy:

Approach	Best For	Sensitivity	Specificity
Read Depth Analysis	Large CNVs (>10kb)	High	Medium
Paired-end Mapping

Best Practices:

- ✅ Use appropriate bin size: 1000bp for WGS, smaller for targeted analysis
✅ Ensure sufficient coverage: Minimum 15-20x for reliable CNV detection
✅ Match reference genome: Use same reference as alignment (hg19 vs hg38)
✅ Check coverage uniformity: GC bias can cause false positive CNVs

Common Issues and Solutions:

Issue: False positive CNVs in repetitive regions

- Symptom: Many CNV calls in centromeres, telomeres, or segmental duplications
Solution: Filter CNVs overlapping known problematic regions; use mappability filters

Issue: Low sensitivity for small CNVs

- Symptom: Missing CNVs <5kb despite adequate coverage
Solution: Reduce bin size; use split-read or paired-end signals in addition to depth

2. Genomic Segmentation and Binning

Divide the genome into windows/bins for copy number estimation, enabling systematic analysis of the entire genome.

CODEBLOCK2

Bin Size Selection Guide:

Bin Size	Resolution	Use Case	Coverage Required
100 bp	High	Small CNVs (<5kb)	>30x
1000 bp

Standard | General WGS analysis | >15x |
| 10000 bp | Low | Large chromosomal alterations | >5x |
| Variable | Adaptive | Mixed resolution | >20x |

Best Practices:

- ✅ Match bin size to expected CNV size: Use smaller bins for detecting small CNVs
✅ Consider coverage depth: Higher coverage enables smaller bins
✅ Exclude unmappable regions: Filter bins with zero or very low mappability
✅ Normalize for GC content: GC-rich regions have different coverage patterns

Common Issues and Solutions:

Issue: Noisy segmentation due to small bins

- Symptom: Erratic copy number estimates with high variance
Solution: Increase bin size; apply smoothing algorithms; use larger bins for baseline

Issue: Missing large CNVs with large bins

- Symptom: Large deletions/amplifications not called when spanning multiple bins
Solution: Use statistical segmentation (CBS, PSCBS) to join adjacent altered bins

3. Genome-Wide Visualization

Generate publication-quality plots showing copy number profiles across all chromosomes for visual interpretation and presentation.

CODEBLOCK3

Output Formats:

Format	Extension	Best For	File Size
PNG	.png	Web, presentations, quick viewing	Medium
PDF

.pdf | Publications, high-quality printing | Large |
| SVG | .svg | Vector editing, scalable graphics | Small |

Best Practices:

- ✅ Use PDF for publications: Vector format maintains quality at any zoom
✅ Include baseline (CN=2): Reference line helps interpret gains/losses
✅ Color-blind friendly palette: Use distinct colors for gains vs losses
✅ Annotate key regions: Mark known cancer genes or regions of interest

Common Issues and Solutions:

Issue: Plot too crowded with many CNVs

- Symptom: Overlapping points make plot unreadable
Solution: Use segmentation to merge adjacent calls; adjust point size/alpha

Issue: ChrY not displayed for female samples

- Symptom: Missing chromosome in plot for female subjects
Solution: Dynamically detect sex from coverage; adjust plot accordingly

4. BED Format Export

Export CNV calls in standard BED format for compatibility with genome browsers and downstream analysis tools.

CODEBLOCK4

BED Format Specification:

Column	Field	Description	Example
1	chrom	Chromosome name	chr1, chrX
2

start | Start position (0-based) | 1000000 |
| 3 | end | End position (1-based) | 2000000 |
| 4 | name | CNV annotation | CN=3 |
| 5 | score | Optional quality score | . |
| 6 | strand | Strand info (usually .) | . |

Best Practices:

- ✅ Use 0-based coordinates: Standard BED format uses 0-based start, 1-based end
✅ Include copy number in name: Makes CNV status immediately visible
✅ Sort by chromosome and position: Required for many tools (bedtools, IGV)
✅ Validate format: Check with bedtools or genome browser before distribution

Common Issues and Solutions:

Issue: BED file rejected by genome browser

- Symptom: IGV or UCSC Genome Browser shows error loading BED
Solution: Ensure proper chromosome naming (chr1 vs 1); sort file; check for tabs vs spaces

Issue: Coordinate system confusion

- Symptom: CNVs appear shifted by 1bp in different tools
Solution: BED is 0-based, GFF/VCF are 1-based; convert if necessary

5. Tumor-Normal Comparison

Compare CNV profiles between tumor and matched normal samples to identify somatic copy number alterations (SCNAs).

CODEBLOCK5

Somatic vs Germline Classification:

Category	Tumor CN	Normal CN	Interpretation
Somatic Amplification	>2	2	Tumor-specific gain
Somatic Deletion

<2 | 2 | Tumor-specific loss |
| Germline CNV | ≠2 | ≠2 | Inherited CNV |
| LOH | 1 | 2 | Loss of heterozygosity |

Best Practices:

- ✅ Use matched normal when available: Essential for distinguishing somatic vs germline
✅ Consider tumor purity: Low purity samples have attenuated CNV signals
✅ Validate key findings: Use orthogonal methods (FISH, qPCR) for important CNVs
✅ Account for clonality: Subclonal CNVs may be present at lower frequencies

Common Issues and Solutions:

Issue: Normal sample contamination in tumor

- Symptom: CNV signals weaker than expected; fractional copy numbers
Solution: Estimate tumor purity; use purity-corrected CNV calling

Issue: Germline CNVs misclassified as somatic

- Symptom: Many "somatic" CNVs that look like common polymorphisms
Solution: Filter against population CNV databases (DGV, gnomAD-SV)

6. Quality Control and Filtering

Apply quality filters to remove artifactual CNV calls and improve result reliability.

CODEBLOCK6

Quality Metrics:

Metric	Threshold	Purpose
Quality Score	>20	Overall confidence in CNV call
Size

Best Practices:

- ✅ Apply size filters: Remove CNVs <1kb (often artifacts)
✅ Filter repetitive regions: Exclude known problematic regions
✅ Use multiple evidence types: Combine depth, paired-end, and split-read signals
✅ Validate high-impact CNVs: Use orthogonal methods for therapeutic targets

Common Issues and Solutions:

Issue: Too many low-quality CNV calls

- Symptom: Hundreds or thousands of CNVs called
Solution: Increase quality thresholds; apply population frequency filters

Issue: True CNVs filtered out

- Symptom: Known cancer driver CNVs missing from results
Solution: Use gene-specific filters; manually review regions of interest

Complete Workflow Example

From WGS data to CNV visualization:

CODEBLOCK7

Python API Usage:

CODEBLOCK8

Expected Output Files:

CODEBLOCK9

Common Patterns

Pattern 1: Cancer Genome Analysis (Tumor-Normal Pair)

Scenario: Identify somatic copy number alterations in a cancer sample compared to matched normal tissue.

CODEBLOCK10

Workflow:

1. Process both tumor and normal BAM files
Call CNVs in each sample independently
Compare to identify somatic alterations
Filter germline polymorphisms against population databases
Annotate cancer genes within CNV regions
Generate publication-quality visualization
Validate key driver alterations with orthogonal methods

Output Example:
CODEBLOCK11

Pattern 2: Rare Disease CNV Detection

Scenario: Detect pathogenic CNVs in a patient with suspected genomic disorder.

CODEBLOCK12

Workflow:

1. Call CNVs with high sensitivity settings
Filter against common population CNVs (DGV, gnomAD)
Prioritize rare CNVs (<1% frequency)
Annotate with disease-associated genes
Assess inheritance pattern (if parental data available)
Cross-reference with phenotype/HPO terms
Generate clinical report with prioritized findings

Output Example:
CODEBLOCK13

Pattern 3: Population CNV Analysis

Scenario: Compare CNV profiles across multiple samples to identify recurrent alterations.

CODEBLOCK14

Workflow:

1. Call CNVs in all samples with consistent parameters
Merge and harmonize CNV calls across samples
Identify recurrent CNV regions
Perform burden analysis (total CNV load)
Test association with phenotype/status
Correct for multiple testing
Visualize CNV landscape across cohort

Output Example:
CODEBLOCK15

Pattern 4: Cell Line Characterization

Scenario: Characterize CNV profile of a cancer cell line for research or quality control.

CODEBLOCK16

Workflow:

1. Generate high-quality CNV profile from WGS
Compare to reference profiles (CCLE, COSMIC)
Verify expected cancer driver alterations
Identify subclonal populations
Assess genome stability metrics
Generate QC report for cell line authentication
Document for reproducibility

Output Example:

Cell Line: MCF-7
Identity confirmed: Yes (99.2% match to reference)

Expected alterations detected:
  chr8:128000000-129000000: CN=8 (MYC) ✓
  chr20:50000000-52000000: CN=6 (ZNF217) ✓

Additional alterations:
  chr17:35000000-37000000: CN=3 (ERBB2) ✓
  
Ploidy: 2.8 (aneuploid)
Genome instability score: High

Quality Checklist

Pre-analysis Checks:

- [ ] CRITICAL: Verify input BAM file is properly aligned and indexed
[ ] Confirm reference genome version matches alignment (hg19 vs hg38)
[ ] Check sequencing coverage is sufficient (>15x for WGS, >30x for high resolution)
[ ] Assess coverage uniformity (low uniformity causes CNV artifacts)
[ ] Review FASTQC reports for quality issues
[ ] Ensure matched normal sample is available for cancer analysis
[ ] Verify sample identity (check sex chromosomes match metadata)
[ ] Confirm no sample swaps or contamination

During Analysis:

- [ ] Select appropriate bin size for expected CNV size and coverage
[ ] Apply GC content normalization if necessary
[ ] Check for batch effects if analyzing multiple samples
[ ] Monitor for high false positive rates in repetitive regions
[ ] Validate sex chromosome calls against known sex
[ ] Assess mitochondrial CNVs as quality control metric
[ ] Review coverage plots for technical artifacts
[ ] Check concordance with SNP array data if available

Post-analysis Verification:

- [ ] CRITICAL: Filter CNVs in known problematic regions (centromeres, telomeres)
[ ] Remove common germline CNVs using population databases (DGV, gnomAD)
[ ] Validate cancer driver alterations in known genes
[ ] Check for CNV calls that disrupt single exons (often artifacts)
[ ] Review very large CNVs (>50Mb) for technical artifacts
[ ] Assess CNV burden against population norms
[ ] Verify BED file format compliance
[ ] Generate and review genome-wide plots

Before Clinical or Publication Use:

- [ ] CRITICAL: Have results reviewed by experienced analyst
[ ] Validate pathogenic CNVs with orthogonal methods (FISH, qPCR, MLPA)
[ ] Cross-reference with clinical databases (ClinVar, OMIM, Decipher)
[ ] Document all parameters and filters applied
[ ] Assess reproducibility by re-running with different parameters
[ ] Check for batch effects in multi-sample analyses
[ ] Confirm CNV coordinates with latest genome build
[ ] Archive raw data and analysis scripts for reproducibility

Common Pitfalls

Input Data Issues:

- ❌ Using low coverage data → Noisy CNV calls with many false positives

- ✅ Minimum 15-20x coverage for reliable WGS CNV calling

- ❌ Mismatched reference genomes → CNVs called in wrong coordinates

- ✅ Verify BAM uses same reference as CNV caller (hg19 vs hg38)

- ❌ Not using matched normal for tumors → Cannot distinguish somatic vs germline

- ✅ Always use matched normal when available; use population controls otherwise

- ❌ Poor coverage uniformity → GC bias causes false CNVs

- ✅ Check coverage plots; apply GC correction algorithms

Analysis Parameter Issues:

- ❌ Bin size too large → Miss small CNVs (<10kb)

- ✅ Use 100-500bp bins for high-resolution analysis; 1000bp for standard WGS

- ❌ Bin size too small → Excessive noise in low coverage regions

- ✅ Balance resolution with coverage; use adaptive binning if available

- ❌ Inadequate quality filtering → Too many false positive CNVs

- ✅ Apply minimum quality scores; filter by size and read support

- ❌ Not filtering common CNVs → Report common polymorphisms as pathogenic

- ✅ Filter against DGV, gnomAD, and other population databases

Interpretation Issues:

- ❌ Ignoring tumor purity → Misinterpret subclonal CNVs

- ✅ Estimate tumor purity; adjust CNV calling thresholds accordingly

- ❌ Not validating key findings → Report false positive driver alterations

- ✅ Validate cancer-relevant CNVs with orthogonal methods

- ❌ Over-interpreting small CNVs → Single-exon deletions are often artifacts

- ✅ Focus on larger CNVs (>10kb) unless supported by multiple evidence types

- ❌ Ignoring parental data → Cannot determine inheritance in rare disease

- ✅ Include parental samples for de novo vs inherited classification

Output and Reporting Issues:

- ❌ Unclear coordinate system → Confusion between 0-based and 1-based

- ✅ Clearly document coordinate system used; BED is 0-based, VCF is 1-based

- ❌ Missing quality metrics → Cannot assess confidence in CNV calls

- ✅ Include quality scores, supporting reads, and log2 ratios

- ❌ Not archiving raw data → Results cannot be reproduced

- ✅ Save BAM files, parameter settings, and analysis scripts

- ❌ Inadequate documentation → Others cannot interpret results

- ✅ Document all filters, thresholds, and databases used

Troubleshooting

Problem: No CNVs detected

- Symptoms: Empty or nearly empty CNV call set
Causes:

- Coverage too low (<10x)
- Bin size too large for small CNVs
- Quality thresholds too stringent
- Sample is actually diploid with no CNVs

- Solutions:

- Verify coverage depth from BAM file
- Reduce bin size for higher resolution
- Relax quality filters temporarily
- Check coverage uniformity across genome

Problem: Too many CNV calls (hundreds or thousands)

- Symptoms: Excessive number of CNV calls, many small or low-quality
Causes:

- Low coverage or high noise
- Bin size too small
- No quality filtering applied
- Sample from highly polymorphic population

- Solutions:

- Apply minimum quality score filter (Q>20)
- Filter by minimum size (>1kb)
- Remove calls in segmental duplications
- Filter against population CNV databases

Problem: False positives in repetitive regions

- Symptoms: CNVs concentrated in centromeres, telomeres, or SDs
Causes:

- Low mappability in repetitive regions
- Uneven coverage due to alignment issues
- Reference genome gaps

- Solutions:

- Filter CNVs overlapping known problematic regions
- Use mappability filters (require mappability >0.8)
- Exclude centromeres and telomeres from analysis
- Use high-mappability reads only

Problem: CNV signals too weak in tumor samples

- Symptoms: Known cancer alterations not detected or weak signal
Causes:

- Low tumor purity (<20%)
- Normal cell contamination
- Subclonal alterations at low frequency

- Solutions:

- Estimate tumor purity from VAF distribution
- Use purity-corrected CNV calling
- Lower thresholds for detection
- Consider single-cell sequencing for subclonal analysis

Problem: Sex chromosomes have unexpected copy numbers

- Symptoms: XX sample showing CN=1 for X, or XY showing CN=2
Causes:

- Sex chromosome aneuploidy (e.g., Klinefelter, Turner syndromes)
- Mislabeled sample sex
- Pseudoautosomal region miscalls

- Solutions:

- Verify sample sex from coverage ratios (X/Y)
- Check clinical records for known sex chromosome abnormalities
- Exclude pseudoautosomal regions from analysis
- Analyze autosomes and sex chromosomes separately

Problem: Batch effects in multi-sample analysis

- Symptoms: CNV patterns correlate with sequencing batch rather than biology
Causes:

- Different sequencing platforms or chemistries
- Coverage differences between batches
- Different alignment parameters

- Solutions:

- Normalize coverage across batches
- Use same alignment and processing pipeline for all samples
- Include batch as covariate in association testing
- Perform batch correction algorithms

Problem: Cannot install or run tool

- Symptoms: Import errors, missing dependencies, execution failures
Causes:

- Missing Python packages (pysam, numpy, matplotlib)
- Incompatible Python version
- Missing reference genome index files

- Solutions:

- Install required packages: pip install pysam numpy matplotlib pandas
- Use Python 3.8 or higher
- Create reference genome index: samtools faidx reference.fa
- Check BAM file index exists: sample.bam.bai

References

Available in references/ directory:

- (No reference files currently available for this skill)

External Resources:

- Database of Genomic Variants (DGV): http://dgv.tcag.ca
gnomAD Structural Variants: https://gnomad.broadinstitute.org
ClinVar: https://www.ncbi.nlm.nih.gov/clinvar
DECIPHER: https://www.deciphergenomics.org
COSMIC: https://cancer.sanger.ac.uk

Scripts

Located in scripts/ directory:

- main.py - Main CNV calling and plotting engine

CNV Detection Methods Comparison

Method	Input	Sensitivity	Resolution	Best For
Read Depth (this tool)	BAM	Medium	1-10 kb	Large CNVs, WGS
Paired-end Mapping

Parameters

Parameter	Type	Default	Required	Description
INLINECODE24, INLINECODE25	string	-	Yes	Input BAM/VCF file
INLINECODE26, INLINECODE27

string | - | Yes | Reference genome FASTA | | --output, -o | string | ./cnv_output | No | Output directory | | --bin-size | int | 1000 | No | Bin size for analysis | | --plot-format | string | png | No | Plot format (png, pdf, svg) |

Usage

Basic Usage

CODEBLOCK18

Risk Assessment

Risk Indicator	Assessment	Level
Code Execution	Python script executed locally	Low
Network Access

Security Checklist

- [x] No hardcoded credentials or API keys
[x] No unauthorized file system access
[x] Input validation for file paths
[x] Output directory restricted
[x] Error messages sanitized
[x] CRITICAL: HIPAA compliance required for patient data

Prerequisites

CODEBLOCK19

Evaluation Criteria

Success Metrics

- [x] Successfully processes BAM/VCF files
[x] Detects copy number variations
[x] Generates visualization plots
[x] Outputs results in BED format

Test Cases

1. Basic Calling: BAM input → CNV calls with coordinates
Plot Generation: CNV calls → Genome-wide plot
Custom Bin Size: Different bin sizes → Appropriate resolution

Lifecycle Status

- Current Stage: Active
Next Review Date: 2026-03-09
Known Issues: Placeholder CNV calling logic
Planned Improvements:

- Implement actual CNV calling algorithm - Add tumor/normal comparison - Enhance visualization options

Last Updated: 2026-02-09 Skill ID: 162 Version: 2.0 (K-Dense Standard)

CNV 检测与绘图工具

从全基因组测序（WGS）数据中检测拷贝数变异（CNV），并生成适用于癌症基因组学、罕见病分析和群体遗传学研究的全基因组可视化图谱。提供CNV检测、分段分析以及可直接用于发表的图表。

核心功能：

- 基于WGS的CNV检测：从比对后的测序数据中识别拷贝数增加和缺失
基因组分段：将基因组划分为区间/窗口以进行拷贝数估算
灵活的输入支持：处理BAM、VCF及其他标准基因组学格式
出版级图表：生成PNG、PDF或SVG格式的全基因组CNV图谱
标准输出格式：以BED格式导出CNV检测结果，便于下游分析

使用场景

✅ 适用场景：

- 分析癌症基因组以识别体细胞拷贝数改变（SCNA）
研究疑似由拷贝数变异引起的罕见病
进行群体遗传学研究，比较不同群体间的CNV频率
为出版物或报告生成全基因组CNV可视化图谱
创建BED格式的CNV检测结果，以便与其他分析流程整合
对肿瘤和正常样本进行CNV比较分析
使用测序数据验证SNP芯片的CNV检测结果

❌ 不适用场景：

- 处理靶向测序panel（全外显子组/靶向捕获）→ 请使用CNVkit或ExomeDepth等专用工具
检测涉及易位或倒位的结构变异 → 请使用structural-variant-caller
分析单细胞RNA-seq数据 → 请使用单细胞特异性CNV工具（如inferCNV）
检测小片段插入缺失（<50bp）→ 请使用variant-caller进行小变异检测
需要用于诊断目的的临床级CNV检测 → 请使用经过验证的临床流程并配合适当的质量控制
处理低覆盖度数据（<10x）→ 结果可能不可靠；请考虑基于SNP芯片的方法

相关技能：

- 上游：fastqc-report-interpreter、alignment-quality-checker、variant-caller
下游：circos-plot-generator、go-kegg-enrichment、heatmap-beautifier

与其他技能的整合

上游技能：

- fastqc-report-interpreter：在CNV检测前评估测序质量；低质量数据可能产生不可靠的CNV
alignment-quality-checker：验证BAM文件质量和覆盖度均匀性；不均匀覆盖会导致CNV假象
variant-caller：生成SNV/插入缺失检测结果，用于癌症样本中的CNV-SNV联合分析

下游技能：

- circos-plot-generator：创建整合CNV与其他基因组特征的圆形基因组图谱
go-kegg-enrichment：对CNV区域内的基因进行通路富集分析
heatmap-beautifier：可视化多个样本的CNV图谱

完整工作流程：

原始WGS数据 → fastqc-report-interpreter → alignment-quality-checker → cnv-caller-plotter → circos-plot-generator → 出版级图表

核心功能

1. 拷贝数变异检测

通过分析读段深度模式，从WGS数据中识别拷贝数增加（扩增）或缺失（缺失）的基因组区域。

python
from scripts.main import CNVCaller

使用指定bin大小初始化CNV检测器

caller = CNVCaller(bin_size=1000)

从BAM文件中检测CNV

cnvcalls = caller.callcnvs( input_file=sample.bam, reference=hg38.fa )

查看检测到的CNV

for cnv in cnv_calls: print(f{cnv[chrom]}:{cnv[start]}-{cnv[end]}) print(f 拷贝数: {cnv[cn]}) if cnv[cn] > 2: print(f 类型: 扩增（增加）) elif cnv[cn] < 2: print(f 类型: 缺失（丢失）)

参数：

参数	类型	必需	描述	默认值
input_file	str	是	输入BAM或VCF文件的路径	无
reference

str | 是 | 参考基因组FASTA文件的路径 | 无 |
| bin_size | int | 否 | 用于分段的基因组bin大小（bp） | 1000 |

CNV检测策略：

方法	最适合	灵敏度	特异性
读段深度分析	大CNV（>10kb）	高	中等
双端比对

中等CNV（1-10kb） | 中等 | 高 |
| 分裂读段分析 | 小CNV（<1kb） | 中等 | 高 |
| 组合方法 | 全面检测 | 高 | 高 |

最佳实践：

- ✅ 使用合适的bin大小：WGS使用1000bp，靶向分析使用更小的bin
✅ 确保足够的覆盖度：可靠的CNV检测至少需要15-20x覆盖度
✅ 匹配参考基因组：使用与比对相同的参考基因组（hg19 vs hg38）
✅ 检查覆盖度均匀性：GC偏好可能导致假阳性CNV

常见问题及解决方案：

问题：重复区域的假阳性CNV

- 症状：着丝粒、端粒或节段性重复区域出现大量CNV检测结果
解决方案：过滤与已知问题区域重叠的CNV；使用可作图性过滤器

问题：对小CNV的灵敏度低

- 症状：尽管覆盖度足够，但未能检测到<5kb的CNV
解决方案：减小bin大小；除深度外，同时使用分裂读段或双端信号

2. 基因组分段与分箱

将基因组划分为窗口/区间以进行拷贝数估算，从而实现对全基因组的系统性分析。

python
from scripts.main import CNVCaller

不同应用场景使用不同的bin大小

bin_configs = { 高分辨率: 100, # 用于小CNV检测标准: 1000, # WGS默认值低分辨率: 10000 # 用于大规模改变 }

for configname, binsize in bin_configs.items():
caller = CNVCaller(binsize=binsize)
print(f\n{configname} (binsize={bin_size}bp):)

# 估算人类基因组的近似bin数量
genomesize = 3000000000 # 3 Gb
numbins = genomesize // bin_size
print(f 估算bin数: ~{num_bins:,})
print(f 分辨率: {bin_size}bp)

Bin大小选择指南：

Bin大小	分辨率	使用场景	所需覆盖度
100 bp	高	小CNV（<5kb）	>30x
1000 bp

标准 | 常规WGS分析 | >15x |
| 10000 bp | 低 | 大染色体改变 | >5x |
| 可变 | 自适应 | 混合分辨率 | >20x |

最佳实践：

- ✅ 将bin大小与预期CNV大小匹配：检测小CNV时使用更小的bin
✅ 考虑覆盖深度：更高的覆盖度允许使用更小的bin
✅ 排除不可作图区域：过滤可作图性为零或极低的bin
✅ 针对GC含量进行归一化：GC富集区域具有不同的覆盖模式

常见问题及解决方案：

问题：小bin导致的分段噪声大

- 症状：拷贝数估算不稳定，方差高
解决方案：增大bin大小；应用平滑算法；使用更大的bin作为基线

问题：大bin导致遗漏大CNV

- 症状：跨越多个bin的大缺失/扩增未被检测到
解决方案：使用统计分段方法（CBS、PSCBS）连接相邻的改变bin

3. 全基因组可视化

生成出版级图表，显示所有染色体的拷贝数图谱，用于视觉解释和展示。

python
from scripts.main import CNVCaller

caller = CNVCaller(bin_size=1000)

用于绘图的示例CNV检测结果

cnv_calls = [ {chrom: chr1, start: 1000000, end: 200

cnv-caller-plotterCNV检测绘图