CRISPR Screen Analyzer

Analyze pooled CRISPR screening data to identify essential genes, drug resistance/sensitivity candidates, and screen quality metrics. Supports Robust Rank Aggregation (RRA) analysis, quality control assessment, and hit identification for functional genomics studies.

Key Capabilities:

- Quality Control Assessment: Calculate Gini index, read depth, and dropout metrics to evaluate screen quality
Log Fold Change Calculation: Compute sgRNA-level fold changes between treatment and control conditions
Statistical Analysis: Perform Robust Rank Aggregation (RRA) to identify significantly enriched or depleted sgRNAs
Hit Identification: Apply FDR and fold change thresholds to identify candidate genes
Multi-Sample Support: Process multiple replicates and treatment conditions simultaneously

When to Use

✅ Use this skill when:

- Analyzing genome-wide viability screens to identify essential genes required for cell survival
Performing drug resistance screens to find genes whose knockout confers resistance
Conducting drug sensitivity screens to identify synthetic lethal interactions
Performing quality control assessment of CRISPR screen data before downstream analysis
Comparing multiple treatment conditions (e.g., drug vs DMSO, hypoxia vs normoxia)
Validating screen quality before publication or further experimental validation
Generating hit lists for secondary screens or validation experiments

❌ Do NOT use when:

- Analyzing single-cell CRISPR data (Perturb-seq, CROP-seq) → Use specialized single-cell analysis tools
Working with arrayed CRISPR screens (well-by-well format) → Use standard differential expression analysis
Performing CRISPR activation (CRISPRa) or interference (CRISPRi) screens → May need adjusted normalization
Requiring Bayesian or MAGeCK statistical analysis → This tool uses RRA; use MAGeCK for alternative algorithms
Analyzing small custom libraries (<1000 sgRNAs) → Statistical power may be insufficient
Time-course CRISPR screens → Requires specialized trajectory analysis methods

Related Skills:

- 上游 (Upstream): crispr-grna-designer, INLINECODE1
下游 (Downstream): go-kegg-enrichment, pathway-visualization, INLINECODE4

Integration with Other Skills

Upstream Skills:

- crispr-grna-designer: Design sgRNA libraries before screening; validate library composition
INLINECODE6: Assess sequencing quality before CRISPR screen analysis
INLINECODE7: Verify sgRNA alignment rates and mapping quality

Downstream Skills:

- go-kegg-enrichment: Perform pathway enrichment on identified hit genes
INLINECODE9: Visualize hits in pathway contexts
INLINECODE10: Design follow-up experiments for candidate genes
INLINECODE11: Compare screen results with known essential gene databases

Complete Workflow:

Library Design (crispr-grna-designer) → Transduction → Sequencing → fastqc-report-interpreter → crispr-screen-analyzer → go-kegg-enrichment → Hit Validation

Core Capabilities

1. Quality Control Metrics Calculation

Assess CRISPR screen quality using established metrics including Gini index, read depth, and sgRNA dropout rates.

CODEBLOCK1

QC Metrics Explained:

Metric	Target Range	Interpretation
Gini Index	<0.3	Measures library evenness; lower = more uniform
Total Reads

Best Practices:

- ✅ Check Gini index first: Values >0.4 indicate potential library bias or bottleneck
✅ Compare replicates: QC metrics should be consistent across replicates
✅ Assess time points: Later time points typically show higher dropout
✅ Validate early: Poor QC may require screen repetition

Common Issues and Solutions:

Issue: High Gini index (>0.4)

- Symptom: Uneven sgRNA representation suggesting library bottleneck
Solution: Check MOI (multiplicity of infection); verify puromycin selection; consider repeating screen

Issue: Excessive zero-count sgRNAs (>10%)

- Symptom: Many sgRNAs not detected in final samples
Causes: Low sequencing depth, library degradation, or strong selection
Solution: Increase sequencing depth; verify library quality at transduction

2. Log Fold Change Calculation

Calculate log2 fold changes between treatment and control conditions to identify enriched or depleted sgRNAs.

CODEBLOCK2

LFC Calculation:
CODEBLOCK3

Interpretation:

LFC Range	Interpretation	Biological Meaning
LFC < -2	Strong depletion	Essential gene or drug sensitivity
LFC -2 to -1

Best Practices:

- ✅ Use pseudocount of 1 to avoid log(0) issues
✅ Average replicates to reduce technical variance
✅ Visualize distribution to identify batch effects or outliers
✅ Check positive controls (known essential genes should have negative LFC)

Common Issues and Solutions:

Issue: Skewed LFC distribution

- Symptom: Mean LFC significantly different from 0
Causes: Library size differences, batch effects, or strong selection
Solution: Apply TMM or DESeq2 normalization; check for batch effects

Issue: Extreme outliers

- Symptom: Few sgRNAs with very large LFC values
Solution: Winsorize extreme values; verify these are not technical artifacts

3. Robust Rank Aggregation (RRA) Statistical Analysis

Perform statistical analysis to identify significantly enriched or depleted sgRNAs using z-score and FDR correction.

CODEBLOCK4

RRA Analysis Steps:

1. Z-score calculation: INLINECODE12
P-value calculation: Two-tailed normal test
FDR correction: Benjamini-Hochberg procedure

Statistical Output:

Column	Description	Usage
INLINECODE13	sgRNA identifier	Mapping to genes
INLINECODE14

Best Practices:

- ✅ Use FDR < 0.05 as standard significance threshold
✅ Consider FDR < 0.01 for high-confidence hits
✅ Combine p-value and LFC for hit prioritization
✅ Validate top hits experimentally before publication

Common Issues and Solutions:

Issue: No significant hits despite visible effects

- Symptom: Biological effects present but no FDR-significant results
Causes: High variance, insufficient replicates, or weak effects
Solution: Increase replicate number; use more permissive FDR threshold; use gene-level aggregation

Issue: Too many significant hits

- Symptom: Hundreds or thousands of FDR-significant sgRNAs
Causes: Low variance, strong selection, or batch effects
Solution: Apply more stringent FDR threshold; add LFC cutoff; filter by effect size

4. Hit Identification with Thresholds

Apply statistical and biological thresholds to identify candidate genes for follow-up validation.

CODEBLOCK5

Hit Classification:

Category	Criteria	Biological Interpretation
Essential	FDR<0.05, LFC<-1	Required for cell viability
Drug Sensitive

Best Practices:

- ✅ Use consistent thresholds across related screens for comparability
✅ Require multiple sgRNAs per gene for confidence (≥2 recommended)
✅ Validate with orthogonal methods (siRNA, rescue experiments)
✅ Compare with known essential genes as positive controls

Common Issues and Solutions:

Issue: Single sgRNA hits

- Symptom: Only one sgRNA per gene significant
Solution: Require ≥2 significant sgRNAs per gene; check for off-target effects

Issue: Off-target effects dominating

- Symptom: Known essential genes not identified; unexpected hits prominent
Solution: Use second-generation libraries with improved specificity; validate with rescue

5. Gene-Level Aggregation

Aggregate sgRNA-level results to gene-level statistics for biological interpretation.

CODEBLOCK6

Gene Aggregation Methods:

Method	Description	Best For
Mean LFC	Average across sgRNAs	General hit calling
Best FDR

Best Practices:

- ✅ Require ≥3 sgRNAs per gene for reliable gene-level calling
✅ Use mean LFC for primary analysis; best FDR for validation
✅ Check sgRNA concordance - all should show same direction
✅ Remove genes with conflicting sgRNAs from hit list

Common Issues and Solutions:

Issue: Discordant sgRNAs for same gene

- Symptom: Some sgRNAs positive, others negative for same gene
Causes: Off-target effects, library errors, or complex biology
Solution: Exclude genes with discordant sgRNAs; investigate specific cases

6. Multi-Condition Comparison

Compare CRISPR screen results across multiple treatment conditions or time points.

CODEBLOCK7

Multi-Condition Analysis:

Comparison Type	Question Addressed	Interpretation
Drug vs Control	What genes mediate drug response?	Resistance/sensitivity mechanisms
Condition A vs B

Best Practices:

- ✅ Use same control across multiple treatments for comparability
✅ Check correlation between replicates and conditions
✅ Look for condition-specific hits for mechanism insights
✅ Validate common hits as robust findings

Common Issues and Solutions:

Issue: High variability between replicates

- Symptom: Low correlation between replicates of same condition
Solution: Increase replicate number; check for technical batch effects

Complete Workflow Example

From count matrix to hit identification:

CODEBLOCK8

Python API Usage:

CODEBLOCK9

Expected Output Files:

CODEBLOCK10

Common Patterns

Pattern 1: Viability Screen (Essential Gene Identification)

Scenario: Identify genes essential for cell survival by comparing T0 (transduction) vs T14 (14 days post-transduction).

CODEBLOCK11

Workflow:

1. Collect cells at T0 (immediately after transduction)
Maintain parallel culture for 14 days (T14)
Harvest T14 cells when control cells reach confluence
Sequence both T0 and T14 samples
Analyze depletion of sgRNAs at T14 relative to T0
Identify genes with significantly depleted sgRNAs (essential genes)
Validate top hits with individual sgRNA validation

Output Example:
CODEBLOCK12

Pattern 2: Drug Resistance Screen

Scenario: Identify genes whose knockout confers resistance to a cytotoxic drug (e.g., vemurafenib in BRAF-mutant melanoma).

CODEBLOCK13

Workflow:

1. Transduce cells with genome-wide sgRNA library
Split into drug-treated and DMSO control groups
Treat with drug at appropriate concentration (IC70-IC90)
Maintain for 2-3 weeks until control cells die
Harvest resistant colonies from drug-treated group
Compare sgRNA representation: Drug vs DMSO
Identify enriched sgRNAs (resistance genes)
Validate resistance with individual sgRNAs and drug dose-response

Output Example:
CODEBLOCK14

Pattern 3: Drug Sensitivity/Synthetic Lethality Screen

Scenario: Identify genes that, when knocked out, sensitize cells to drug treatment (synthetic lethal interactions).

CODEBLOCK15

Workflow:

1. Transduce cells with sgRNA library
Treat with sub-lethal drug concentration (IC30)
Maintain for 2 weeks under drug selection
Compare sgRNA representation: Drug-treated vs control
Identify depleted sgRNAs (synthetic lethal/sensitizer genes)
Validate with individual sgRNAs and combination assays
Compare with genetic dependency maps (DepMap)

Output Example:
CODEBLOCK16

Pattern 4: Comparative Screen (Cell Line vs Cell Line)

Scenario: Compare genetic dependencies between two cell lines to identify lineage-specific vulnerabilities.

CODEBLOCK17

Workflow:

1. Perform viability screens in multiple cell lines in parallel
Normalize each screen independently
Compare gene-level essentiality scores across lines
Identify genes essential in one lineage but not another
Validate lineage-specific dependencies
Explore therapeutic relevance (tumor-type specific targets)

Output Example:

Comparative Screen: Melanoma vs Lung Cancer
  Melanoma-specific essential: 127 genes
  Lung-specific essential: 203 genes
  Common essential: 1,847 genes
  
Top Melanoma-Specific Dependencies:
  MITF:   LFC diff = -4.5 (essential in melanoma, not lung)
  SOX10:  LFC diff = -3.8
  TYR:    LFC diff = -3.2
  
Top Lung-Specific Dependencies:
  NKX2-1: LFC diff = -3.9
  TP63:   LFC diff = -3.1
  
Therapeutic Implications:
  - Lineage-specific targets identified
  - Potential for tumor-type selective therapy

Quality Checklist

Pre-Analysis Checks:

- [ ] CRITICAL: Verify library composition matches expected sgRNA list
[ ] Check sequencing depth (>10M reads per sample recommended)
[ ] Confirm sample annotations match count matrix columns
[ ] Verify control and treatment sample assignments are correct
[ ] Check for batch effects (different sequencing runs, library preps)
[ ] Review positive control performance (known essential genes)
[ ] Confirm negative controls show no significant effects
[ ] Validate replicate consistency (correlation >0.7 expected)

During Analysis:

- [ ] Calculate and review QC metrics (Gini, read depth, dropout)
[ ] CRITICAL: Check Gini index <0.4 for library quality
[ ] Examine LFC distribution for normality and outliers
[ ] Verify positive controls are significantly depleted (viability screens)
[ ] Check for batch effects using PCA or correlation heatmaps
[ ] Apply appropriate statistical thresholds (FDR < 0.05 standard)
[ ] Require multiple sgRNAs per gene for hit calling (≥2 recommended)
[ ] Compare hit lists with published data for similar screens

Post-Analysis Verification:

- [ ] CRITICAL: Validate top hits show concordance across sgRNAs
[ ] Check known positive controls are recovered
[ ] Assess negative control performance (should not be significant)
[ ] Compare replicate correlation for hits vs non-hits
[ ] Review hit gene functions for biological plausibility
[ ] Check for potential off-target effects (seed sequence analysis)
[ ] Verify hit numbers are reasonable (10s-100s, not 1000s)
[ ] Generate visualization (MA plots, volcano plots, heatmaps)

Before Validation or Publication:

- [ ] CRITICAL: Validate top 5-10 hits with individual sgRNAs
[ ] Perform rescue experiments to confirm on-target effects
[ ] Compare with orthogonal datasets (DepMap, published screens)
[ ] Check for cell line-specific vs pan-essential classification
[ ] Assess therapeutic relevance of identified hits
[ ] Plan secondary screens if primary screen quality issues found
[ ] Document all parameters and thresholds used
[ ] Prepare data for public deposition (if applicable)

Common Pitfalls

Experimental Design Issues:

- ❌ Insufficient sequencing depth → Poor statistical power, missed hits

- ✅ Minimum 10M reads per sample; 20M+ for complex libraries

- ❌ Library bottleneck → Gini index >0.4, skewed representation

- ✅ Maintain MOI <0.3; use sufficient cell numbers (500-1000x library coverage)

- ❌ Inadequate replicates → High variance, irreproducible results

- ✅ Use ≥3 biological replicates per condition

- ❌ Wrong time point → Too early (no selection) or too late (extensive dropout)

- ✅ Optimize time point based on doubling time and selection pressure

Analysis Issues:

- ❌ Ignoring QC metrics → Analyzing poor quality data

- ✅ Always review Gini index, read depth, and dropout before analysis

- ❌ Incorrect sample assignment → Control/treatment mix-up

- ✅ Double-check sample annotation file; validate with positive controls

- ❌ Single sgRNA hits → Potential off-target effects

- ✅ Require ≥2 significant sgRNAs per gene; check concordance

- ❌ Over-reliance on p-values → Many false positives with large library

- ✅ Use FDR correction; add LFC threshold; validate experimentally

Interpretation Issues:

- ❌ Ignoring cell number effects → Different growth rates confound results

- ✅ Normalize for cell doublings; use appropriate controls

- ❌ Off-target effects dominating → False positive hits

- ✅ Use improved libraries (e.g., Brunello, Brie); validate with rescue

- ❌ Pan-essential vs selective → Misclassifying broadly essential genes

- ✅ Compare with DepMap data; use differential analysis for specificity

- ❌ Not validating hits → Publishing false positives

- ✅ Validate top hits with individual sgRNAs; perform rescue experiments

Technical Issues:

- ❌ Batch effects → Confounding by library prep or sequencing batch

- ✅ Randomize samples across batches; include batch in statistical model

- ❌ Contamination → Cross-sample contamination affects quantification

- ✅ Use unique molecular identifiers (UMIs); check for index hopping

- ❌ Reference genome mismatch → sgRNAs not mapping correctly

- ✅ Use same genome version as library design; check sgRNA sequences

- ❌ Incomplete annotation → sgRNAs missing gene mapping

- ✅ Verify library annotation file is complete and current

Troubleshooting

Problem: No significant hits despite strong biological effect

- Symptoms: Clear phenotype but no FDR-significant sgRNAs
Causes:

- High variance between replicates
- Insufficient sequencing depth
- Weak effect sizes
- Stringent statistical thresholds

- Solutions:

- Increase replicate number
- Increase sequencing depth
- Use more permissive FDR threshold (0.1)
- Consider gene-level aggregation

Problem: Too many significant hits (1000s)

- Symptoms: Excessive number of hits, many likely false positives
Causes:

- Low variance (overdispersion underestimated)
- Strong selection pressure
- Library quality issues
- Noisy data

- Solutions:

- Use more stringent FDR threshold (0.01)
- Increase LFC threshold (1.5 or 2.0)
- Filter by sgRNA concordance
- Review QC metrics and repeat if poor quality

Problem: High Gini index (>0.4)

- Symptoms: Library representation highly skewed
Causes:

- Library bottleneck at transduction
- Insufficient cell numbers
- High MOI leading to multiple integrations

- Solutions:

- Use lower MOI (<0.3)
- Increase cell numbers (500-1000x library size)
- Improve transduction efficiency
- Consider repeating screen

Problem: Known essential genes not identified

- Symptoms: Positive controls (RPL30, RPS19) not significantly depleted
Causes:

- Insufficient selection time
- Library quality issues
- Analysis errors

- Solutions:

- Extend time point for viability screens
- Check library composition and representation
- Verify analysis parameters (control vs treatment assignment)

Problem: Discordant sgRNAs for same gene

- Symptoms: Only 1-2 of 5 sgRNAs significant for hit genes
Causes:

- Off-target effects
- Variable sgRNA efficiency
- Library design issues

- Solutions:

- Require ≥3 significant sgRNAs for gene-level hits
- Check sgRNA sequences for off-target potential
- Use improved second-generation libraries
- Validate with independent sgRNAs

Problem: Batch effects between replicates

- Symptoms: Low correlation between replicates of same condition
Causes:

- Different library prep batches
- Different sequencing runs
- Technical variation

- Solutions:

- Include batch as covariate in analysis
- Use ComBat or similar batch correction
- Re-sequence inconsistent replicates
- Randomize samples across batches in future

Problem: Negative controls showing significant effects

- Symptoms: Non-targeting controls (NTC) or safe-targeting sgRNAs in hit list
Causes:

- Technical artifacts
- Random chance with large library
- Library design issues

- Solutions:

- Review NTC performance; should not be systematically enriched/depleted
- If systematic, investigate technical issues
- Use NTC distribution to set empirical thresholds

References

Available in references/ directory:

- (No reference files currently available for this skill)

External Resources:

- AddGene CRISPR Libraries: https://www.addgene.org/crispr/libraries/
DepMap Portal: https://depmap.org/portal/
MAGeCK Documentation: https://sourceforge.net/p/mageck/wiki/Home/
BAGEL Algorithm: https://github.com/hart-lab/bagel
CRISPR Screen Analysis Best Practices: https://pubmed.ncbi.nlm.nih.gov/29651053/

Scripts

Located in scripts/ directory:

- main.py - CRISPR screen analysis engine with QC, RRA, and hit identification

Common CRISPR Screen Types

Screen Type	Comparison	Expected Hits	Typical Duration
Viability	T14 vs T0	Essential genes depleted	10-14 days
Drug Resistance

Parameters

Parameter	Type	Default	Required	Description
INLINECODE20, INLINECODE21	string	-	Yes	sgRNA count matrix file
INLINECODE22, INLINECODE23

string | - | Yes | Sample annotation file | | --control | string | - | No | Control samples (comma-separated) | | --treatment, -t | string | - | No | Treatment samples (comma-separated) | | --output, -o | string | - | No | Output directory | | --fdr | float | 0.05 | No | FDR threshold |

Usage

Basic Usage

CODEBLOCK19

Risk Assessment

Risk Indicator	Assessment	Level
Code Execution	Python script executed locally	Low
Network Access

Security Checklist

- [x] No hardcoded credentials or API keys
[x] No unauthorized file system access
[x] Input validation for file paths
[x] Output directory restricted
[x] Error messages sanitized
[x] Script execution in sandboxed environment

Prerequisites

CODEBLOCK20

Evaluation Criteria

Success Metrics

- [x] Successfully loads sgRNA count matrices
[x] Calculates QC metrics (Gini index, zero counts)
[x] Performs RRA analysis
[x] Identifies significant hits with FDR control

Test Cases

1. Basic Analysis: Count matrix + samplesheet → QC metrics + hit list
RRA Analysis: Control vs Treatment → Ranked gene list with p-values
QC Metrics: Count data → Gini scores, zero sgRNA counts

Lifecycle Status

- Current Stage: Active
Next Review Date: 2026-03-09
Known Issues: None
Planned Improvements:

- Add MAGeCK integration - Support for multiple analysis methods - Enhanced visualization

Last Updated: 2026-02-09 Skill ID: 183 Version: 2.0 (K-Dense Standard)

CRISPR Screen Analyzer

分析混合型CRISPR筛选数据，以鉴定必需基因、耐药/敏感候选基因以及筛选质量指标。支持稳健秩聚合（RRA）分析、质量控制评估和功能基因组学研究的命中基因鉴定。

关键能力：

- 质量控制评估：计算基尼指数、测序深度和丢失指标以评估筛选质量
对数倍数变化计算：计算处理组和对照组之间sgRNA水平的倍数变化
统计分析：执行稳健秩聚合（RRA）以鉴定显著富集或缺失的sgRNA
命中基因鉴定：应用FDR和倍数变化阈值鉴定候选基因
多样本支持：同时处理多个重复和处理条件

使用时机

✅ 在以下情况下使用此技能：

- 分析全基因组活力筛选以鉴定细胞存活所需的必需基因
执行耐药筛选以发现敲除后赋予耐药性的基因
进行药物敏感性筛选以鉴定合成致死相互作用
在下游分析前对CRISPR筛选数据进行质量控制评估
比较多个处理条件（例如，药物 vs DMSO，缺氧 vs 常氧）
在发表或进一步实验验证前验证筛选质量
为二次筛选或验证实验生成命中基因列表

❌ 在以下情况下不要使用：

- 分析单细胞CRISPR数据（Perturb-seq、CROP-seq）→ 使用专门的单细胞分析工具
处理阵列式CRISPR筛选（逐孔格式）→ 使用标准的差异表达分析
执行CRISPR激活（CRISPRa）或干扰（CRISPRi）筛选 → 可能需要调整归一化方法
需要贝叶斯或MAGeCK统计分析 → 此工具使用RRA；如需其他算法请使用MAGeCK
分析小型定制文库（<1000个sgRNA）→ 统计效力可能不足
时间序列CRISPR筛选 → 需要专门的轨迹分析方法

相关技能：

- 上游：crispr-grna-designer、fastqc-report-interpreter
下游：go-kegg-enrichment、pathway-visualization、hit-validation-planner

与其他技能的集成

上游技能：

- crispr-grna-designer：在筛选前设计sgRNA文库；验证文库组成
fastqc-report-interpreter：在CRISPR筛选分析前评估测序质量
alignment-quality-checker：验证sgRNA比对率和比对质量

下游技能：

- go-kegg-enrichment：对鉴定的命中基因进行通路富集分析
pathway-visualization：在通路背景下可视化命中基因
hit-validation-planner：为候选基因设计后续实验
gene-essentiality-predictor：将筛选结果与已知必需基因数据库进行比较

完整工作流程：

文库设计（crispr-grna-designer）→ 转导 → 测序 → fastqc-report-interpreter → crispr-screen-analyzer → go-kegg-enrichment → 命中基因验证

核心功能

1. 质量控制指标计算

使用既定指标评估CRISPR筛选质量，包括基尼指数、测序深度和sgRNA丢失率。

python
from scripts.main import CRISPRScreenAnalyzer

使用计数矩阵和样本注释初始化分析器

analyzer = CRISPRScreenAnalyzer( countsfile=sgrnacounts.txt, samplesheet=samples.csv )

计算QC指标

qcresults = analyzer.qcmetrics()

查看关键指标

print(质量控制指标：) print(每个样本的总读数：) for sample, reads in qcresults[totalreads].items(): print(f {sample}: {reads:,} 个读数)

print(f\n基尼指数（文库代表性）：)
for sample, gini in qcresults[giniindex].items():
status = ✅ 良好 if gini < 0.3 else ⚠️ 需检查 if gini < 0.4 else ❌ 差
print(f {sample}: {gini:.3f} {status})

print(f\n零计数sgRNA（潜在丢失）：)
for sample, zeros in qcresults[zerocount_sgrnas].items():
pct = (zeros / len(analyzer.counts)) * 100
print(f {sample}: {zeros} ({pct:.1f}%))

QC指标说明：

指标	目标范围	解释
基尼指数	<0.3	衡量文库均匀性；越低表示越均匀
总读数

最佳实践：

- ✅ 首先检查基尼指数：值>0.4表示潜在的文库偏差或瓶颈
✅ 比较重复样本：QC指标应在重复样本间保持一致
✅ 评估时间点：较晚的时间点通常显示更高的丢失率
✅ 早期验证：QC差可能需要重复筛选

常见问题及解决方案：

问题：基尼指数高（>0.4）

- 症状：sgRNA代表性不均匀，表明文库瓶颈
解决方案：检查MOI（感染复数）；验证嘌呤霉素筛选；考虑重复筛选

问题：零计数sgRNA过多（>10%）

- 症状：最终样本中许多sgRNA未被检测到
原因：测序深度低、文库降解或强选择压力
解决方案：增加测序深度；在转导时验证文库质量

2. 对数倍数变化计算

计算处理组和对照组之间的log2倍数变化，以鉴定富集或缺失的sgRNA。

python
from scripts.main import CRISPRScreenAnalyzer

analyzer = CRISPRScreenAnalyzer(counts.txt, samples.csv)

定义样本组

controlsamples = [Control1, Control2, Control3] treatmentsamples = [Drug1, Drug2, Drug3]

计算对数倍数变化

lfc = analyzer.calculatelfc(controlsamples, treatment_samples)

分析分布

print(对数倍数变化统计：) print(f 均值：{lfc.mean():.3f}) print(f 标准差：{lfc.std():.3f}) print(f 最大值：{lfc.max():.3f}) print(f 最小值：{lfc.min():.3f})

鉴定极端变化

strong_depletion = lfc[lfc < -2] # 强负向选择 strong_enrichment = lfc[lfc > 2] # 强正向选择

print(f\n强缺失sgRNA：{len(strong_depletion)})
print(f强富集sgRNA：{len(strong_enrichment)})

LFC计算：

lfc = log2((处理组均值 + 1) / (对照组均值 + 1))

解释：

LFC范围	解释	生物学意义
LFC < -2	强缺失	必需基因或药物敏感性
LFC -2 至 -1

最佳实践：

- ✅ 使用伪计数1以避免log(0)问题
✅ 平均重复样本以减少技术变异
✅ 可视化分布以鉴定批次效应或异常值
✅ 检查阳性对照（已知必需基因应具有负LFC）

常见问题及解决方案：

问题：LFC分布偏斜

- 症状：均值LFC显著偏离0
原因：文库大小差异、批次效应或强选择压力
解决方案：应用TMM或DESeq2归一化；检查批次效应

问题：极端异常值

- 症状：少数sgRNA具有非常大的LFC值
解决方案：对极端值进行缩尾处理；验证这些不是技术伪影

3. 稳健秩聚合（RRA）统计分析

使用z分数和FDR校正进行统计分析，以鉴定显著富集或缺失的sgRNA。

python
from scripts.main import CRISPRScreenAnalyzer

analyzer = CRISPRScreenAnalyzer(counts.txt, samples.csv)

首先计算LFC

lfc = analyzer.calculate_lfc( controlsamples=[Ctrl1, Ctrl_

crispr-screen-analyzerCRISPR筛选分析器