CRISPR Screen Analyzer
Analyze pooled CRISPR screening data to identify essential genes, drug resistance/sensitivity candidates, and screen quality metrics. Supports Robust Rank Aggregation (RRA) analysis, quality control assessment, and hit identification for functional genomics studies.
Key Capabilities:
- - Quality Control Assessment: Calculate Gini index, read depth, and dropout metrics to evaluate screen quality
- Log Fold Change Calculation: Compute sgRNA-level fold changes between treatment and control conditions
- Statistical Analysis: Perform Robust Rank Aggregation (RRA) to identify significantly enriched or depleted sgRNAs
- Hit Identification: Apply FDR and fold change thresholds to identify candidate genes
- Multi-Sample Support: Process multiple replicates and treatment conditions simultaneously
When to Use
✅ Use this skill when:
- - Analyzing genome-wide viability screens to identify essential genes required for cell survival
- Performing drug resistance screens to find genes whose knockout confers resistance
- Conducting drug sensitivity screens to identify synthetic lethal interactions
- Performing quality control assessment of CRISPR screen data before downstream analysis
- Comparing multiple treatment conditions (e.g., drug vs DMSO, hypoxia vs normoxia)
- Validating screen quality before publication or further experimental validation
- Generating hit lists for secondary screens or validation experiments
❌ Do NOT use when:
- - Analyzing single-cell CRISPR data (Perturb-seq, CROP-seq) → Use specialized single-cell analysis tools
- Working with arrayed CRISPR screens (well-by-well format) → Use standard differential expression analysis
- Performing CRISPR activation (CRISPRa) or interference (CRISPRi) screens → May need adjusted normalization
- Requiring Bayesian or MAGeCK statistical analysis → This tool uses RRA; use MAGeCK for alternative algorithms
- Analyzing small custom libraries (<1000 sgRNAs) → Statistical power may be insufficient
- Time-course CRISPR screens → Requires specialized trajectory analysis methods
Related Skills:
- - 上游 (Upstream):
crispr-grna-designer, INLINECODE1 - 下游 (Downstream):
go-kegg-enrichment, pathway-visualization, INLINECODE4
Integration with Other Skills
Upstream Skills:
- -
crispr-grna-designer: Design sgRNA libraries before screening; validate library composition - INLINECODE6 : Assess sequencing quality before CRISPR screen analysis
- INLINECODE7 : Verify sgRNA alignment rates and mapping quality
Downstream Skills:
- -
go-kegg-enrichment: Perform pathway enrichment on identified hit genes - INLINECODE9 : Visualize hits in pathway contexts
- INLINECODE10 : Design follow-up experiments for candidate genes
- INLINECODE11 : Compare screen results with known essential gene databases
Complete Workflow:
Library Design (crispr-grna-designer) → Transduction → Sequencing → fastqc-report-interpreter → crispr-screen-analyzer → go-kegg-enrichment → Hit Validation
Core Capabilities
1. Quality Control Metrics Calculation
Assess CRISPR screen quality using established metrics including Gini index, read depth, and sgRNA dropout rates.
CODEBLOCK1
QC Metrics Explained:
| Metric | Target Range | Interpretation |
|---|
| Gini Index | <0.3 | Measures library evenness; lower = more uniform |
| Total Reads |
>10M per sample | Sufficient depth for statistical power |
|
Zero-count sgRNAs | <5% | Acceptable dropout; higher indicates library loss |
|
Read Distribution | Log-normal | Should follow expected distribution |
Best Practices:
- - ✅ Check Gini index first: Values >0.4 indicate potential library bias or bottleneck
- ✅ Compare replicates: QC metrics should be consistent across replicates
- ✅ Assess time points: Later time points typically show higher dropout
- ✅ Validate early: Poor QC may require screen repetition
Common Issues and Solutions:
Issue: High Gini index (>0.4)
- - Symptom: Uneven sgRNA representation suggesting library bottleneck
- Solution: Check MOI (multiplicity of infection); verify puromycin selection; consider repeating screen
Issue: Excessive zero-count sgRNAs (>10%)
- - Symptom: Many sgRNAs not detected in final samples
- Causes: Low sequencing depth, library degradation, or strong selection
- Solution: Increase sequencing depth; verify library quality at transduction
2. Log Fold Change Calculation
Calculate log2 fold changes between treatment and control conditions to identify enriched or depleted sgRNAs.
CODEBLOCK2
LFC Calculation:
CODEBLOCK3
Interpretation:
| LFC Range | Interpretation | Biological Meaning |
|---|
| LFC < -2 | Strong depletion | Essential gene or drug sensitivity |
| LFC -2 to -1 |
Moderate depletion | Moderate effect |
|
LFC -1 to 1 | No change | No significant effect |
|
LFC 1 to 2 | Moderate enrichment | Moderate resistance |
|
LFC > 2 | Strong enrichment | Resistance gene or suppressor |
Best Practices:
- - ✅ Use pseudocount of 1 to avoid log(0) issues
- ✅ Average replicates to reduce technical variance
- ✅ Visualize distribution to identify batch effects or outliers
- ✅ Check positive controls (known essential genes should have negative LFC)
Common Issues and Solutions:
Issue: Skewed LFC distribution
- - Symptom: Mean LFC significantly different from 0
- Causes: Library size differences, batch effects, or strong selection
- Solution: Apply TMM or DESeq2 normalization; check for batch effects
Issue: Extreme outliers
- - Symptom: Few sgRNAs with very large LFC values
- Solution: Winsorize extreme values; verify these are not technical artifacts
3. Robust Rank Aggregation (RRA) Statistical Analysis
Perform statistical analysis to identify significantly enriched or depleted sgRNAs using z-score and FDR correction.
CODEBLOCK4
RRA Analysis Steps:
- 1. Z-score calculation: INLINECODE12
- P-value calculation: Two-tailed normal test
- FDR correction: Benjamini-Hochberg procedure
Statistical Output:
| Column | Description | Usage |
|---|
| INLINECODE13 | sgRNA identifier | Mapping to genes |
| INLINECODE14 |
Log fold change | Effect size |
|
pvalue | Raw p-value | Statistical significance |
|
fdr | Adjusted p-value (FDR) | Multiple testing correction |
Best Practices:
- - ✅ Use FDR < 0.05 as standard significance threshold
- ✅ Consider FDR < 0.01 for high-confidence hits
- ✅ Combine p-value and LFC for hit prioritization
- ✅ Validate top hits experimentally before publication
Common Issues and Solutions:
Issue: No significant hits despite visible effects
- - Symptom: Biological effects present but no FDR-significant results
- Causes: High variance, insufficient replicates, or weak effects
- Solution: Increase replicate number; use more permissive FDR threshold; use gene-level aggregation
Issue: Too many significant hits
- - Symptom: Hundreds or thousands of FDR-significant sgRNAs
- Causes: Low variance, strong selection, or batch effects
- Solution: Apply more stringent FDR threshold; add LFC cutoff; filter by effect size
4. Hit Identification with Thresholds
Apply statistical and biological thresholds to identify candidate genes for follow-up validation.
CODEBLOCK5
Hit Classification:
| Category | Criteria | Biological Interpretation |
|---|
| Essential | FDR<0.05, LFC<-1 | Required for cell viability |
| Drug Sensitive |
FDR<0.05, LFC<-1 | Synthetic lethal with treatment |
|
Drug Resistant | FDR<0.05, LFC>1 | Confers resistance to treatment |
|
Suppressor | FDR<0.05, LFC>1 | Suppresses phenotype of interest |
Best Practices:
- - ✅ Use consistent thresholds across related screens for comparability
- ✅ Require multiple sgRNAs per gene for confidence (≥2 recommended)
- ✅ Validate with orthogonal methods (siRNA, rescue experiments)
- ✅ Compare with known essential genes as positive controls
Common Issues and Solutions:
Issue: Single sgRNA hits
- - Symptom: Only one sgRNA per gene significant
- Solution: Require ≥2 significant sgRNAs per gene; check for off-target effects
Issue: Off-target effects dominating
- - Symptom: Known essential genes not identified; unexpected hits prominent
- Solution: Use second-generation libraries with improved specificity; validate with rescue
5. Gene-Level Aggregation
Aggregate sgRNA-level results to gene-level statistics for biological interpretation.
CODEBLOCK6
Gene Aggregation Methods:
| Method | Description | Best For |
|---|
| Mean LFC | Average across sgRNAs | General hit calling |
| Best FDR |
Most significant sgRNA | Conservative approach |
|
Second-best | Second most significant | Reduces outlier effects |
|
STARS/RRA | Rank-based aggregation | Standard CRISPR analysis |
Best Practices:
- - ✅ Require ≥3 sgRNAs per gene for reliable gene-level calling
- ✅ Use mean LFC for primary analysis; best FDR for validation
- ✅ Check sgRNA concordance - all should show same direction
- ✅ Remove genes with conflicting sgRNAs from hit list
Common Issues and Solutions:
Issue: Discordant sgRNAs for same gene
- - Symptom: Some sgRNAs positive, others negative for same gene
- Causes: Off-target effects, library errors, or complex biology
- Solution: Exclude genes with discordant sgRNAs; investigate specific cases
6. Multi-Condition Comparison
Compare CRISPR screen results across multiple treatment conditions or time points.
CODEBLOCK7
Multi-Condition Analysis:
| Comparison Type | Question Addressed | Interpretation |
|---|
| Drug vs Control | What genes mediate drug response? | Resistance/sensitivity mechanisms |
| Condition A vs B |
Differential genetic dependencies | Context-specific essentiality |
|
Time-course | How does genetic dependency change? | Temporal dynamics |
|
Cell line comparison | Cell-type specific dependencies | Lineage-specific vulnerabilities |
Best Practices:
- - ✅ Use same control across multiple treatments for comparability
- ✅ Check correlation between replicates and conditions
- ✅ Look for condition-specific hits for mechanism insights
- ✅ Validate common hits as robust findings
Common Issues and Solutions:
Issue: High variability between replicates
- - Symptom: Low correlation between replicates of same condition
- Solution: Increase replicate number; check for technical batch effects
Complete Workflow Example
From count matrix to hit identification:
CODEBLOCK8
Python API Usage:
CODEBLOCK9
Expected Output Files:
CODEBLOCK10
Common Patterns
Pattern 1: Viability Screen (Essential Gene Identification)
Scenario: Identify genes essential for cell survival by comparing T0 (transduction) vs T14 (14 days post-transduction).
CODEBLOCK11
Workflow:
- 1. Collect cells at T0 (immediately after transduction)
- Maintain parallel culture for 14 days (T14)
- Harvest T14 cells when control cells reach confluence
- Sequence both T0 and T14 samples
- Analyze depletion of sgRNAs at T14 relative to T0
- Identify genes with significantly depleted sgRNAs (essential genes)
- Validate top hits with individual sgRNA validation
Output Example:
CODEBLOCK12
Pattern 2: Drug Resistance Screen
Scenario: Identify genes whose knockout confers resistance to a cytotoxic drug (e.g., vemurafenib in BRAF-mutant melanoma).
CODEBLOCK13
Workflow:
- 1. Transduce cells with genome-wide sgRNA library
- Split into drug-treated and DMSO control groups
- Treat with drug at appropriate concentration (IC70-IC90)
- Maintain for 2-3 weeks until control cells die
- Harvest resistant colonies from drug-treated group
- Compare sgRNA representation: Drug vs DMSO
- Identify enriched sgRNAs (resistance genes)
- Validate resistance with individual sgRNAs and drug dose-response
Output Example:
CODEBLOCK14
Pattern 3: Drug Sensitivity/Synthetic Lethality Screen
Scenario: Identify genes that, when knocked out, sensitize cells to drug treatment (synthetic lethal interactions).
CODEBLOCK15
Workflow:
- 1. Transduce cells with sgRNA library
- Treat with sub-lethal drug concentration (IC30)
- Maintain for 2 weeks under drug selection
- Compare sgRNA representation: Drug-treated vs control
- Identify depleted sgRNAs (synthetic lethal/sensitizer genes)
- Validate with individual sgRNAs and combination assays
- Compare with genetic dependency maps (DepMap)
Output Example:
CODEBLOCK16
Pattern 4: Comparative Screen (Cell Line vs Cell Line)
Scenario: Compare genetic dependencies between two cell lines to identify lineage-specific vulnerabilities.
CODEBLOCK17
Workflow:
- 1. Perform viability screens in multiple cell lines in parallel
- Normalize each screen independently
- Compare gene-level essentiality scores across lines
- Identify genes essential in one lineage but not another
- Validate lineage-specific dependencies
- Explore therapeutic relevance (tumor-type specific targets)
Output Example:
Comparative Screen: Melanoma vs Lung Cancer
Melanoma-specific essential: 127 genes
Lung-specific essential: 203 genes
Common essential: 1,847 genes
Top Melanoma-Specific Dependencies:
MITF: LFC diff = -4.5 (essential in melanoma, not lung)
SOX10: LFC diff = -3.8
TYR: LFC diff = -3.2
Top Lung-Specific Dependencies:
NKX2-1: LFC diff = -3.9
TP63: LFC diff = -3.1
Therapeutic Implications:
- Lineage-specific targets identified
- Potential for tumor-type selective therapy
Quality Checklist
Pre-Analysis Checks:
- - [ ] CRITICAL: Verify library composition matches expected sgRNA list
- [ ] Check sequencing depth (>10M reads per sample recommended)
- [ ] Confirm sample annotations match count matrix columns
- [ ] Verify control and treatment sample assignments are correct
- [ ] Check for batch effects (different sequencing runs, library preps)
- [ ] Review positive control performance (known essential genes)
- [ ] Confirm negative controls show no significant effects
- [ ] Validate replicate consistency (correlation >0.7 expected)
During Analysis:
- - [ ] Calculate and review QC metrics (Gini, read depth, dropout)
- [ ] CRITICAL: Check Gini index <0.4 for library quality
- [ ] Examine LFC distribution for normality and outliers
- [ ] Verify positive controls are significantly depleted (viability screens)
- [ ] Check for batch effects using PCA or correlation heatmaps
- [ ] Apply appropriate statistical thresholds (FDR < 0.05 standard)
- [ ] Require multiple sgRNAs per gene for hit calling (≥2 recommended)
- [ ] Compare hit lists with published data for similar screens
Post-Analysis Verification:
- - [ ] CRITICAL: Validate top hits show concordance across sgRNAs
- [ ] Check known positive controls are recovered
- [ ] Assess negative control performance (should not be significant)
- [ ] Compare replicate correlation for hits vs non-hits
- [ ] Review hit gene functions for biological plausibility
- [ ] Check for potential off-target effects (seed sequence analysis)
- [ ] Verify hit numbers are reasonable (10s-100s, not 1000s)
- [ ] Generate visualization (MA plots, volcano plots, heatmaps)
Before Validation or Publication:
- - [ ] CRITICAL: Validate top 5-10 hits with individual sgRNAs
- [ ] Perform rescue experiments to confirm on-target effects
- [ ] Compare with orthogonal datasets (DepMap, published screens)
- [ ] Check for cell line-specific vs pan-essential classification
- [ ] Assess therapeutic relevance of identified hits
- [ ] Plan secondary screens if primary screen quality issues found
- [ ] Document all parameters and thresholds used
- [ ] Prepare data for public deposition (if applicable)
Common Pitfalls
Experimental Design Issues:
- - ❌ Insufficient sequencing depth → Poor statistical power, missed hits
- ✅ Minimum 10M reads per sample; 20M+ for complex libraries
- - ❌ Library bottleneck → Gini index >0.4, skewed representation
- ✅ Maintain MOI <0.3; use sufficient cell numbers (500-1000x library coverage)
- - ❌ Inadequate replicates → High variance, irreproducible results
- ✅ Use ≥3 biological replicates per condition
- - ❌ Wrong time point → Too early (no selection) or too late (extensive dropout)
- ✅ Optimize time point based on doubling time and selection pressure
Analysis Issues:
- - ❌ Ignoring QC metrics → Analyzing poor quality data
- ✅ Always review Gini index, read depth, and dropout before analysis
- - ❌ Incorrect sample assignment → Control/treatment mix-up
- ✅ Double-check sample annotation file; validate with positive controls
- - ❌ Single sgRNA hits → Potential off-target effects
- ✅ Require ≥2 significant sgRNAs per gene; check concordance
- - ❌ Over-reliance on p-values → Many false positives with large library
- ✅ Use FDR correction; add LFC threshold; validate experimentally
Interpretation Issues:
- - ❌ Ignoring cell number effects → Different growth rates confound results
- ✅ Normalize for cell doublings; use appropriate controls
- - ❌ Off-target effects dominating → False positive hits
- ✅ Use improved libraries (e.g., Brunello, Brie); validate with rescue
- - ❌ Pan-essential vs selective → Misclassifying broadly essential genes
- ✅ Compare with DepMap data; use differential analysis for specificity
- - ❌ Not validating hits → Publishing false positives
- ✅ Validate top hits with individual sgRNAs; perform rescue experiments
Technical Issues:
- - ❌ Batch effects → Confounding by library prep or sequencing batch
- ✅ Randomize samples across batches; include batch in statistical model
- - ❌ Contamination → Cross-sample contamination affects quantification
- ✅ Use unique molecular identifiers (UMIs); check for index hopping
- - ❌ Reference genome mismatch → sgRNAs not mapping correctly
- ✅ Use same genome version as library design; check sgRNA sequences
- - ❌ Incomplete annotation → sgRNAs missing gene mapping
- ✅ Verify library annotation file is complete and current
Troubleshooting
Problem: No significant hits despite strong biological effect
- - Symptoms: Clear phenotype but no FDR-significant sgRNAs
- Causes:
- High variance between replicates
- Insufficient sequencing depth
- Weak effect sizes
- Stringent statistical thresholds
- Increase replicate number
- Increase sequencing depth
- Use more permissive FDR threshold (0.1)
- Consider gene-level aggregation
Problem: Too many significant hits (1000s)
- - Symptoms: Excessive number of hits, many likely false positives
- Causes:
- Low variance (overdispersion underestimated)
- Strong selection pressure
- Library quality issues
- Noisy data
- Use more stringent FDR threshold (0.01)
- Increase LFC threshold (1.5 or 2.0)
- Filter by sgRNA concordance
- Review QC metrics and repeat if poor quality
Problem: High Gini index (>0.4)
- - Symptoms: Library representation highly skewed
- Causes:
- Library bottleneck at transduction
- Insufficient cell numbers
- High MOI leading to multiple integrations
- Use lower MOI (<0.3)
- Increase cell numbers (500-1000x library size)
- Improve transduction efficiency
- Consider repeating screen
Problem: Known essential genes not identified
- - Symptoms: Positive controls (RPL30, RPS19) not significantly depleted
- Causes:
- Insufficient selection time
- Library quality issues
- Analysis errors
- Extend time point for viability screens
- Check library composition and representation
- Verify analysis parameters (control vs treatment assignment)
Problem: Discordant sgRNAs for same gene
- - Symptoms: Only 1-2 of 5 sgRNAs significant for hit genes
- Causes:
- Off-target effects
- Variable sgRNA efficiency
- Library design issues
- Require ≥3 significant sgRNAs for gene-level hits
- Check sgRNA sequences for off-target potential
- Use improved second-generation libraries
- Validate with independent sgRNAs
Problem: Batch effects between replicates
- - Symptoms: Low correlation between replicates of same condition
- Causes:
- Different library prep batches
- Different sequencing runs
- Technical variation
- Include batch as covariate in analysis
- Use ComBat or similar batch correction
- Re-sequence inconsistent replicates
- Randomize samples across batches in future
Problem: Negative controls showing significant effects
- - Symptoms: Non-targeting controls (NTC) or safe-targeting sgRNAs in hit list
- Causes:
- Technical artifacts
- Random chance with large library
- Library design issues
- Review NTC performance; should not be systematically enriched/depleted
- If systematic, investigate technical issues
- Use NTC distribution to set empirical thresholds
References
Available in references/ directory:
- - (No reference files currently available for this skill)
External Resources:
- - AddGene CRISPR Libraries: https://www.addgene.org/crispr/libraries/
- DepMap Portal: https://depmap.org/portal/
- MAGeCK Documentation: https://sourceforge.net/p/mageck/wiki/Home/
- BAGEL Algorithm: https://github.com/hart-lab/bagel
- CRISPR Screen Analysis Best Practices: https://pubmed.ncbi.nlm.nih.gov/29651053/
Scripts
Located in scripts/ directory:
- -
main.py - CRISPR screen analysis engine with QC, RRA, and hit identification
Common CRISPR Screen Types
| Screen Type | Comparison | Expected Hits | Typical Duration |
|---|
| Viability | T14 vs T0 | Essential genes depleted | 10-14 days |
| Drug Resistance |
Drug vs DMSO | Resistance genes enriched | 14-21 days |
|
Drug Sensitivity | Drug vs DMSO | Sensitizers depleted | 14-21 days |
|
Comparative | Cell A vs Cell B | Lineage-specific dependencies | 10-14 days |
|
Sensitizer | Drug A+B vs Drug A | Combination targets | 10-14 days |
Parameters
| Parameter | Type | Default | Required | Description |
|---|
| INLINECODE20 , INLINECODE21 | string | - | Yes | sgRNA count matrix file |
| INLINECODE22 , INLINECODE23 |
string | - | Yes | Sample annotation file |
|
--control | string | - | No | Control samples (comma-separated) |
|
--treatment,
-t | string | - | No | Treatment samples (comma-separated) |
|
--output,
-o | string | - | No | Output directory |
|
--fdr | float | 0.05 | No | FDR threshold |
Usage
Basic Usage
CODEBLOCK19
Risk Assessment
| Risk Indicator | Assessment | Level |
|---|
| Code Execution | Python script executed locally | Low |
| Network Access |
No external API calls | Low |
| File System Access | Read count files, write results | Low |
| Data Exposure | Processes genomic screening data | Medium |
| PHI Risk | May contain cell line genetic info | Low |
Security Checklist
- - [x] No hardcoded credentials or API keys
- [x] No unauthorized file system access
- [x] Input validation for file paths
- [x] Output directory restricted
- [x] Error messages sanitized
- [x] Script execution in sandboxed environment
Prerequisites
CODEBLOCK20
Evaluation Criteria
Success Metrics
- - [x] Successfully loads sgRNA count matrices
- [x] Calculates QC metrics (Gini index, zero counts)
- [x] Performs RRA analysis
- [x] Identifies significant hits with FDR control
Test Cases
- 1. Basic Analysis: Count matrix + samplesheet → QC metrics + hit list
- RRA Analysis: Control vs Treatment → Ranked gene list with p-values
- QC Metrics: Count data → Gini scores, zero sgRNA counts
Lifecycle Status
- - Current Stage: Active
- Next Review Date: 2026-03-09
- Known Issues: None
- Planned Improvements:
- Add MAGeCK integration
- Support for multiple analysis methods
- Enhanced visualization
Last Updated: 2026-02-09
Skill ID: 183
Version: 2.0 (K-Dense Standard)
CRISPR Screen Analyzer
分析混合型CRISPR筛选数据,以鉴定必需基因、耐药/敏感候选基因以及筛选质量指标。支持稳健秩聚合(RRA)分析、质量控制评估和功能基因组学研究的命中基因鉴定。
关键能力:
- - 质量控制评估:计算基尼指数、测序深度和丢失指标以评估筛选质量
- 对数倍数变化计算:计算处理组和对照组之间sgRNA水平的倍数变化
- 统计分析:执行稳健秩聚合(RRA)以鉴定显著富集或缺失的sgRNA
- 命中基因鉴定:应用FDR和倍数变化阈值鉴定候选基因
- 多样本支持:同时处理多个重复和处理条件
使用时机
✅ 在以下情况下使用此技能:
- - 分析全基因组活力筛选以鉴定细胞存活所需的必需基因
- 执行耐药筛选以发现敲除后赋予耐药性的基因
- 进行药物敏感性筛选以鉴定合成致死相互作用
- 在下游分析前对CRISPR筛选数据进行质量控制评估
- 比较多个处理条件(例如,药物 vs DMSO,缺氧 vs 常氧)
- 在发表或进一步实验验证前验证筛选质量
- 为二次筛选或验证实验生成命中基因列表
❌ 在以下情况下不要使用:
- - 分析单细胞CRISPR数据(Perturb-seq、CROP-seq)→ 使用专门的单细胞分析工具
- 处理阵列式CRISPR筛选(逐孔格式)→ 使用标准的差异表达分析
- 执行CRISPR激活(CRISPRa)或干扰(CRISPRi)筛选 → 可能需要调整归一化方法
- 需要贝叶斯或MAGeCK统计分析 → 此工具使用RRA;如需其他算法请使用MAGeCK
- 分析小型定制文库(<1000个sgRNA)→ 统计效力可能不足
- 时间序列CRISPR筛选 → 需要专门的轨迹分析方法
相关技能:
- - 上游:crispr-grna-designer、fastqc-report-interpreter
- 下游:go-kegg-enrichment、pathway-visualization、hit-validation-planner
与其他技能的集成
上游技能:
- - crispr-grna-designer:在筛选前设计sgRNA文库;验证文库组成
- fastqc-report-interpreter:在CRISPR筛选分析前评估测序质量
- alignment-quality-checker:验证sgRNA比对率和比对质量
下游技能:
- - go-kegg-enrichment:对鉴定的命中基因进行通路富集分析
- pathway-visualization:在通路背景下可视化命中基因
- hit-validation-planner:为候选基因设计后续实验
- gene-essentiality-predictor:将筛选结果与已知必需基因数据库进行比较
完整工作流程:
文库设计(crispr-grna-designer)→ 转导 → 测序 → fastqc-report-interpreter → crispr-screen-analyzer → go-kegg-enrichment → 命中基因验证
核心功能
1. 质量控制指标计算
使用既定指标评估CRISPR筛选质量,包括基尼指数、测序深度和sgRNA丢失率。
python
from scripts.main import CRISPRScreenAnalyzer
使用计数矩阵和样本注释初始化分析器
analyzer = CRISPRScreenAnalyzer(
counts
file=sgrnacounts.txt,
samplesheet=samples.csv
)
计算QC指标
qc
results = analyzer.qcmetrics()
查看关键指标
print(质量控制指标:)
print(每个样本的总读数:)
for sample, reads in qc
results[totalreads].items():
print(f {sample}: {reads:,} 个读数)
print(f\n基尼指数(文库代表性):)
for sample, gini in qcresults[giniindex].items():
status = ✅ 良好 if gini < 0.3 else ⚠️ 需检查 if gini < 0.4 else ❌ 差
print(f {sample}: {gini:.3f} {status})
print(f\n零计数sgRNA(潜在丢失):)
for sample, zeros in qcresults[zerocount_sgrnas].items():
pct = (zeros / len(analyzer.counts)) * 100
print(f {sample}: {zeros} ({pct:.1f}%))
QC指标说明:
| 指标 | 目标范围 | 解释 |
|---|
| 基尼指数 | <0.3 | 衡量文库均匀性;越低表示越均匀 |
| 总读数 |
每个样本>10M | 足够的深度以获得统计效力 |
|
零计数sgRNA | <5% | 可接受的丢失率;更高表示文库损失 |
|
读数分布 | 对数正态 | 应遵循预期分布 |
最佳实践:
- - ✅ 首先检查基尼指数:值>0.4表示潜在的文库偏差或瓶颈
- ✅ 比较重复样本:QC指标应在重复样本间保持一致
- ✅ 评估时间点:较晚的时间点通常显示更高的丢失率
- ✅ 早期验证:QC差可能需要重复筛选
常见问题及解决方案:
问题:基尼指数高(>0.4)
- - 症状:sgRNA代表性不均匀,表明文库瓶颈
- 解决方案:检查MOI(感染复数);验证嘌呤霉素筛选;考虑重复筛选
问题:零计数sgRNA过多(>10%)
- - 症状:最终样本中许多sgRNA未被检测到
- 原因:测序深度低、文库降解或强选择压力
- 解决方案:增加测序深度;在转导时验证文库质量
2. 对数倍数变化计算
计算处理组和对照组之间的log2倍数变化,以鉴定富集或缺失的sgRNA。
python
from scripts.main import CRISPRScreenAnalyzer
analyzer = CRISPRScreenAnalyzer(counts.txt, samples.csv)
定义样本组
control
samples = [Control1, Control
2, Control3]
treatment
samples = [Drug1, Drug
2, Drug3]
计算对数倍数变化
lfc = analyzer.calculate
lfc(controlsamples, treatment_samples)
分析分布
print(对数倍数变化统计:)
print(f 均值:{lfc.mean():.3f})
print(f 标准差:{lfc.std():.3f})
print(f 最大值:{lfc.max():.3f})
print(f 最小值:{lfc.min():.3f})
鉴定极端变化
strong_depletion = lfc[lfc < -2] # 强负向选择
strong_enrichment = lfc[lfc > 2] # 强正向选择
print(f\n强缺失sgRNA:{len(strong_depletion)})
print(f强富集sgRNA:{len(strong_enrichment)})
LFC计算:
lfc = log2((处理组均值 + 1) / (对照组均值 + 1))
解释:
| LFC范围 | 解释 | 生物学意义 |
|---|
| LFC < -2 | 强缺失 | 必需基因或药物敏感性 |
| LFC -2 至 -1 |
中度缺失 | 中度效应 |
|
LFC -1 至 1 | 无变化 | 无显著效应 |
|
LFC 1 至 2 | 中度富集 | 中度耐药性 |
|
LFC > 2 | 强富集 | 耐药基因或抑制因子 |
最佳实践:
- - ✅ 使用伪计数1以避免log(0)问题
- ✅ 平均重复样本以减少技术变异
- ✅ 可视化分布以鉴定批次效应或异常值
- ✅ 检查阳性对照(已知必需基因应具有负LFC)
常见问题及解决方案:
问题:LFC分布偏斜
- - 症状:均值LFC显著偏离0
- 原因:文库大小差异、批次效应或强选择压力
- 解决方案:应用TMM或DESeq2归一化;检查批次效应
问题:极端异常值
- - 症状:少数sgRNA具有非常大的LFC值
- 解决方案:对极端值进行缩尾处理;验证这些不是技术伪影
3. 稳健秩聚合(RRA)统计分析
使用z分数和FDR校正进行统计分析,以鉴定显著富集或缺失的sgRNA。
python
from scripts.main import CRISPRScreenAnalyzer
analyzer = CRISPRScreenAnalyzer(counts.txt, samples.csv)
首先计算LFC
lfc = analyzer.calculate_lfc(
control
samples=[Ctrl1, Ctrl_