CRISPR gRNA Designer
Design optimal guide RNA (gRNA) sequences for CRISPR-Cas9 genome editing. Supports on-target efficiency scoring and off-target prediction.
Use Cases
- - Design gRNAs for gene knockout (KO) experiments
- Select high-efficiency guides for specific exons
- Predict and minimize off-target effects
- Optimize for SpCas9, SpCas9-NG, xCas9 variants
Input Parameters
| Parameter | Type | Required | Description |
|---|
| INLINECODE0 | string | Yes | HGNC gene symbol (e.g., TP53, BRCA1) |
| INLINECODE1 |
int | No | Specific exon number (default: all coding exons) |
|
genome_build | string | No | Reference genome: hg38 (default), hg19, mm10 |
|
pam_sequence | string | No | PAM motif: NGG (default), NAG, NGCG |
|
guide_length | int | No | gRNA length in bp (default: 20) |
|
gc_content_min | float | No | Minimum GC% (default: 30) |
|
gc_content_max | float | No | Maximum GC% (default: 70) |
|
poly_t_threshold | int | No | Max consecutive T's (default: 4) |
|
off_target_check | bool | No | Enable off-target prediction (default: true) |
|
max_mismatches | int | No | Max mismatches for off-target (default: 3) |
Output Format
CODEBLOCK0
Scoring Algorithm
On-Target Efficiency Score (0-1)
Combines multiple position-specific features:
- 1. Position-weighted matrix: G at position 20 (+3), C at 19 (+2), etc.
- GC content penalty: Outside 40-60% range reduces score
- Self-complementarity: Hairpin formation penalty
- Poly-T penalty: Transcription terminator sequences
CODEBLOCK1
Off-Target Prediction
- 1. Seed region: Positions 12-20 (PAM-proximal) weighted 3x
- Bulge/mismatch tolerance: Allow up to INLINECODE10
- Genomic location: Coding regions flagged as high-risk
- CFD score: Cutting Frequency Determination for off-target cleavage
Usage Examples
Basic gRNA Design
CODEBLOCK2
High-Specificity Design (strict off-target filtering)
CODEBLOCK3
Batch Processing
CODEBLOCK4
Technical Notes
⚠️ Difficulty: HIGH - Requires manual verification before experimental use
- - In silico predictions have ~60-80% correlation with actual cutting efficiency
- Always validate top 3-5 guides experimentally
- Off-target databases may not include rare variants or cell-line specific mutations
- Consider using Cas9 variants (HiFi, Sniper-Cas9) for reduced off-target activity
References
See references/ for:
- -
scoring_algorithms.pdf - Deep learning models (DeepCRISPR, CRISPRon) - INLINECODE13 - GUIDE-seq validated datasets
- INLINECODE14 - Doench et al. 2014/2016 rules
Implementation
Core script: INLINECODE15
Key functions:
- -
fetch_gene_sequence() - Retrieve exon sequences from Ensembl - INLINECODE17 - Identify PAM-adjacent target sites
- INLINECODE18 - Calculate on-target scores
- INLINECODE19 - Bowtie2/BWA alignment for off-targets
- INLINECODE20 - Multi-criteria optimization
Dependencies
- - Python 3.8+
- Biopython
- pandas, numpy
- pysam (for off-target alignment)
- requests (Ensembl API)
Optional:
- - bowtie2 (local off-target search)
- ViennaRNA (secondary structure prediction)
Validation Status
- - Unit tests: 85% coverage for core algorithms
- Benchmark: Tested against GUIDE-seq validated dataset (n=1,200 guides)
- Status: ⏳ Requires experimental validation - predictions are computational estimates only
Risk Assessment
| Risk Indicator | Assessment | Level |
|---|
| Code Execution | Python scripts with bioinformatics tools | High |
| Network Access |
Ensembl API calls for gene sequences | High |
| File System Access | Read/write genome data and results | Medium |
| Instruction Tampering | Scientific computation guidelines | Low |
| Data Exposure | Genome data handled securely | Medium |
Security Checklist
- - [ ] No hardcoded credentials or API keys
- [ ] Ensembl API requests use HTTPS only
- [ ] Input gene symbols validated against allowed patterns
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no internal paths exposed)
- [ ] Dependencies audited (Biopython, pandas, numpy, pysam, requests)
- [ ] API timeout and retry mechanisms implemented
- [ ] No exposure of internal service architecture
Prerequisites
CODEBLOCK5
Evaluation Criteria
Success Metrics
- - [ ] Successfully retrieves gene sequences from Ensembl API
- [ ] Correctly identifies PAM sites in target exons
- [ ] On-target efficiency scores correlate with validated data (>0.6 correlation)
- [ ] Off-target predictions identify known false positives
- [ ] Output JSON follows specified schema
- [ ] Batch processing handles multiple genes efficiently
Test Cases
- 1. Basic gRNA Design: Input TP53 exon 4 → Valid guide RNAs with scores
- API Integration: Query Ensembl for gene sequence → Successful retrieval
- Off-target Prediction: Input guide with known off-targets → Correct prediction
- Multi-species: Test with hg38, hg19, mm10 → Correct genome handling
- Batch Processing: Input gene list → Efficient parallel processing
- Error Handling: Invalid gene symbol → Graceful error with helpful message
Lifecycle Status
- - Current Stage: Draft
- Next Review Date: 2026-03-06
- Known Issues:
- In silico predictions need experimental validation
- Off-target databases may miss rare variants
- Integration with additional scoring algorithms (DeepCRISPR, CRISPRon)
- Support for additional Cas9 variants (Cas12, Cas13)
- Enhanced batch processing with progress reporting
CRISPR gRNA设计器
为CRISPR-Cas9基因组编辑设计最佳引导RNA(gRNA)序列。支持靶向效率评分和脱靶预测。
使用场景
- - 为基因敲除(KO)实验设计gRNA
- 为特定外显子选择高效引导序列
- 预测并最小化脱靶效应
- 针对SpCas9、SpCas9-NG、xCas9变体进行优化
输入参数
| 参数 | 类型 | 必需 | 描述 |
|---|
| genesymbol | 字符串 | 是 | HGNC基因符号(例如:TP53、BRCA1) |
| targetexon |
整数 | 否 | 特定外显子编号(默认:所有编码外显子) |
| genome_build | 字符串 | 否 | 参考基因组:hg38(默认)、hg19、mm10 |
| pam_sequence | 字符串 | 否 | PAM基序:NGG(默认)、NAG、NGCG |
| guide_length | 整数 | 否 | gRNA长度(碱基对,默认:20) |
| gc
contentmin | 浮点数 | 否 | 最小GC百分比(默认:30) |
| gc
contentmax | 浮点数 | 否 | 最大GC百分比(默认:70) |
| poly
tthreshold | 整数 | 否 | 最大连续T数量(默认:4) |
| off
targetcheck | 布尔值 | 否 | 启用脱靶预测(默认:true) |
| max_mismatches | 整数 | 否 | 脱靶最大错配数(默认:3) |
输出格式
json
{
gene: TP53,
genome: hg38,
guides: [
{
id: TP53E2G1,
exon: 2,
sequence: GAGCGCTGCTCAGATAGCGATGG,
pam: NGG,
position: chr17:7669609-7669631,
strand: +,
gc_content: 52.2,
efficiency_score: 0.78,
offtargetcount: 2,
off_targets: [...],
warnings: []
}
]
}
评分算法
靶向效率评分(0-1)
结合多个位置特异性特征:
- 1. 位置加权矩阵:第20位为G(+3),第19位为C(+2)等
- GC含量惩罚:超出40-60%范围会降低分数
- 自互补性:发夹结构形成惩罚
- 多聚T惩罚:转录终止子序列
python
score = w1positionscore + w2gcscore + w3secondaryscore + w4polyt_score
脱靶预测
- 1. 种子区域:第12-20位(靠近PAM)权重为3倍
- 凸起/错配容忍度:最多允许max_mismatches个错配
- 基因组位置:编码区域标记为高风险
- CFD评分:脱靶切割频率测定
使用示例
基础gRNA设计
bash
python scripts/main.py --gene TP53 --exon 4 --output results.json
高特异性设计(严格脱靶过滤)
bash
python scripts/main.py --gene BRCA1 --max-mismatches 2 --gc-min 35 --gc-max 65
批量处理
bash
python scripts/main.py --gene-list genes.txt --genome mm10 --pam NAG
技术说明
⚠️ 难度:高 - 实验使用前需人工验证
- - 计算机模拟预测与实际切割效率的相关性约为60-80%
- 始终对前3-5个引导序列进行实验验证
- 脱靶数据库可能不包含罕见变异或细胞系特异性突变
- 考虑使用Cas9变体(HiFi、Sniper-Cas9)以减少脱靶活性
参考文献
参见 references/ 目录:
- - scoringalgorithms.pdf - 深度学习模型(DeepCRISPR、CRISPRon)
- offtargetdatabases/ - GUIDE-seq验证数据集
- efficiencybenchmarks/ - Doench等人2014/2016年规则
实现
核心脚本:scripts/main.py
关键函数:
- - fetchgenesequence() - 从Ensembl检索外显子序列
- findpamsites() - 识别PAM邻近靶点
- scoreefficiency() - 计算靶向评分
- predictofftargets() - 使用Bowtie2/BWA比对进行脱靶预测
- rankguides() - 多标准优化
依赖项
- - Python 3.8+
- Biopython
- pandas、numpy
- pysam(用于脱靶比对)
- requests(Ensembl API)
可选:
- - bowtie2(本地脱靶搜索)
- ViennaRNA(二级结构预测)
验证状态
- - 单元测试:核心算法覆盖率达85%
- 基准测试:已针对GUIDE-seq验证数据集(n=1,200个引导序列)进行测试
- 状态:⏳ 需要实验验证 - 预测仅为计算估计值
风险评估
| 风险指标 | 评估 | 级别 |
|---|
| 代码执行 | 使用生物信息学工具的Python脚本 | 高 |
| 网络访问 |
调用Ensembl API获取基因序列 | 高 |
| 文件系统访问 | 读取/写入基因组数据和结果 | 中 |
| 指令篡改 | 科学计算指南 | 低 |
| 数据泄露 | 基因组数据安全处理 | 中 |
安全检查清单
- - [ ] 无硬编码凭据或API密钥
- [ ] Ensembl API请求仅使用HTTPS
- [ ] 输入基因符号已根据允许模式进行验证
- [ ] 输出目录限制在工作空间内
- [ ] 脚本在沙盒环境中执行
- [ ] 错误消息已清理(不暴露内部路径)
- [ ] 依赖项已审计(Biopython、pandas、numpy、pysam、requests)
- [ ] 已实现API超时和重试机制
- [ ] 不暴露内部服务架构
前提条件
bash
Python依赖项
pip install -r requirements.txt
可选工具
bowtie2(用于本地脱靶比对)
ViennaRNA(用于二级结构预测)
评估标准
成功指标
- - [ ] 成功从Ensembl API检索基因序列
- [ ] 正确识别靶向外显子中的PAM位点
- [ ] 靶向效率评分与验证数据相关(相关性>0.6)
- [ ] 脱靶预测识别已知假阳性
- [ ] 输出JSON符合指定模式
- [ ] 批量处理高效处理多个基因
测试用例
- 1. 基础gRNA设计:输入TP53外显子4 → 生成带评分的有效引导RNA
- API集成:查询Ensembl获取基因序列 → 成功检索
- 脱靶预测:输入已知脱靶的引导序列 → 正确预测
- 多物种:使用hg38、hg19、mm10测试 → 正确处理基因组
- 批量处理:输入基因列表 → 高效并行处理
- 错误处理:无效基因符号 → 优雅报错并附带有用信息
生命周期状态
- - 当前阶段:草案
- 下次审查日期:2026-03-06
- 已知问题:
- 计算机模拟预测需要实验验证
- 脱靶数据库可能遗漏罕见变异
- 集成更多评分算法(DeepCRISPR、CRISPRon)
- 支持更多Cas9变体(Cas12、Cas13)
- 增强批量处理并添加进度报告