Setup
On first use, read setup.md for integration guidelines. Create ~/bioinformatics/ with user consent to store project context and preferences.
When to Use
User needs to analyze biological sequences, run genomic pipelines, or interpret sequencing data. Agent handles sequence alignment, variant calling, expression analysis, and format conversions.
Architecture
Memory lives in ~/bioinformatics/. See memory-template.md for structure.
CODEBLOCK0
Quick Reference
| Topic | File |
|---|
| Setup process | INLINECODE4 |
| Memory template |
memory-template.md |
| File formats |
formats.md |
| Tool commands |
tools.md |
| RNA-seq pipeline |
rnaseq.md |
| Variant calling |
variants.md |
Core Rules
1. Verify Input Quality First
Before any analysis, check input data quality:
- - FASTQ: Run FastQC, check per-base quality, adapter content
- BAM: Verify sorted, indexed (
samtools quickcheck) - VCF: Validate format (
bcftools view -h)
Bad input → garbage output. Always QC first.
2. Use Reference Genome Consistently
Track which reference is used per project:
- - Human: GRCh38/hg38 (prefer) or GRCh37/hg19
- Mouse: GRCm39/mm39 or GRCm38/mm10
- Mixing references = invalid results
Store reference info in ~/bioinformatics/memory.md per project.
3. Preserve Raw Data
NEVER modify original FASTQ/BAM files:
- - Work on copies
- Keep originals read-only
- Log every transformation step
4. Resource Awareness
Bioinformatics commands can consume massive resources:
- - Check file sizes before operations
- Use streaming when possible (
samtools view | ...) - Estimate memory needs (BWA: ~6GB for human genome)
- Warn before operations >10 minutes
5. Reproducibility
Every analysis must be reproducible:
- - Log exact tool versions (
samtools --version) - Save command parameters
- Record input file checksums for critical analyses
Common Traps
- - Wrong chromosome naming —
chr1 vs 1 causes silent failures. Check and convert with INLINECODE17 - Unsorted BAM — Most tools expect sorted input. Symptoms: errors or wrong results with no warning
- Index missing — BAM needs
.bai, VCF needs .tbi. Commands fail cryptically without them - Memory exhaustion — Large BAM operations kill the session. Stream or use
--threads wisely - Stale indices — After modifying BAM/VCF, regenerate index. Old index = corrupt reads
- 0-based vs 1-based coordinates — BED is 0-based, VCF/GFF is 1-based. Off-by-one bugs are common
File Formats Quick Reference
| Format | Purpose | Key Tool |
|---|
| FASTA | Reference sequences | INLINECODE21 |
| FASTQ |
Raw reads + quality |
seqtk,
fastp |
| SAM/BAM | Aligned reads |
samtools |
| VCF/BCF | Variants |
bcftools |
| BED | Genomic intervals |
bedtools |
| GFF/GTF | Gene annotations |
gffread |
| BigWig | Coverage tracks |
deepTools |
Essential Commands
Quality Control
CODEBLOCK1
Alignment
CODEBLOCK2
Variant Calling
CODEBLOCK3
Data Manipulation
CODEBLOCK4
Security & Privacy
Data access:
- - Only reads files user explicitly provides as input
- Writes outputs to directories user specifies
- Stores preferences in ~/bioinformatics/ (with consent)
Data that stays local:
- - All sequence data processed locally
- No external API calls for analysis
- Pipeline configs in ~/bioinformatics/
This skill does NOT:
- - Upload sequence data anywhere
- Access files without explicit user instruction
- Infer or collect data beyond explicit inputs
- Make network requests during analysis
Note: Installing tools (conda, brew) and downloading reference genomes requires internet access. These are user-initiated actions.
Related Skills
Install with
clawhub install <slug> if user confirms:
- -
data-analysis — statistical interpretation - INLINECODE31 — hypothesis testing
- INLINECODE32 — research methodology
Feedback
- - If useful: INLINECODE33
- Stay updated: INLINECODE34
设置
首次使用时,请阅读 setup.md 了解集成指南。在用户同意的情况下,创建 ~/bioinformatics/ 目录,用于存储项目上下文和偏好设置。
使用场景
用户需要分析生物序列、运行基因组分析流程或解读测序数据。本技能可处理序列比对、变异检测、表达分析和格式转换等任务。
架构
数据存储于 ~/bioinformatics/ 目录。具体结构请参考 memory-template.md。
~/bioinformatics/
├── memory.md # 项目、偏好设置、参考基因组
├── pipelines/ # 已保存的分析流程配置
└── results/ # 分析输出和日志
快速参考
memory-template.md |
| 文件格式 | formats.md |
| 工具命令 | tools.md |
| RNA-seq 流程 | rnaseq.md |
| 变异检测 | variants.md |
核心规则
1. 首先验证输入质量
在任何分析之前,检查输入数据质量:
- - FASTQ:运行 FastQC,检查每个碱基质量、接头含量
- BAM:验证是否已排序、建立索引(samtools quickcheck)
- VCF:验证格式(bcftools view -h)
输入质量差 → 输出结果不可靠。始终先进行质控。
2. 统一使用参考基因组
跟踪每个项目使用的参考基因组:
- - 人类:GRCh38/hg38(优先)或 GRCh37/hg19
- 小鼠:GRCm39/mm39 或 GRCm38/mm10
- 混合使用参考基因组 = 无效结果
将参考基因组信息按项目存储在 ~/bioinformatics/memory.md 中。
3. 保留原始数据
切勿修改原始 FASTQ/BAM 文件:
- - 在副本上操作
- 将原始文件设为只读
- 记录每一步转换操作
4. 资源意识
生物信息学命令可能消耗大量资源:
- - 操作前检查文件大小
- 尽可能使用流式处理(samtools view | ...)
- 估算内存需求(BWA:人类基因组约需 6GB)
- 操作超过 10 分钟前发出警告
5. 可重复性
每次分析必须可重复:
- - 记录确切的工具版本(samtools --version)
- 保存命令参数
- 对关键分析记录输入文件的校验和
常见陷阱
- - 错误的染色体命名 — chr1 与 1 会导致静默失败。使用 sed s/^chr// 检查和转换
- 未排序的 BAM — 大多数工具需要排序后的输入。症状:错误或错误结果且无警告
- 缺少索引 — BAM 需要 .bai,VCF 需要 .tbi。缺少索引时命令会以难以理解的方式失败
- 内存耗尽 — 大型 BAM 操作会终止会话。使用流式处理或合理使用 --threads
- 过期的索引 — 修改 BAM/VCF 后,重新生成索引。旧索引 = 损坏的读取
- 0 基与 1 基坐标 — BED 是 0 基,VCF/GFF 是 1 基。差一错误很常见
文件格式快速参考
| 格式 | 用途 | 关键工具 |
|---|
| FASTA | 参考序列 | samtools faidx |
| FASTQ |
原始读取 + 质量 | seqtk, fastp |
| SAM/BAM | 比对后的读取 | samtools |
| VCF/BCF | 变异 | bcftools |
| BED | 基因组区间 | bedtools |
| GFF/GTF | 基因注释 | gffread |
| BigWig | 覆盖度轨迹 | deepTools |
基本命令
质量控制
bash
FASTQ 质量报告
fastqc sample.fastq.gz -o qc_reports/
修剪接头 + 低质量碱基
fastp -i R1.fq.gz -I R2.fq.gz -o R1.clean.fq.gz -O R2.clean.fq.gz
BAM 统计信息
samtools flagstat aligned.bam
samtools stats aligned.bam > stats.txt
比对
bash
索引参考基因组(仅一次)
bwa index reference.fa
比对双端读取
bwa mem -t 8 reference.fa R1.fq.gz R2.fq.gz | \
samtools sort -o aligned.bam -
索引 BAM
samtools index aligned.bam
变异检测
bash
检测变异
bcftools mpileup -Ou -f reference.fa aligned.bam | \
bcftools call -mv -Oz -o variants.vcf.gz
索引 VCF
bcftools index variants.vcf.gz
过滤变异
bcftools filter -s LowQual -e QUAL<20 variants.vcf.gz
数据处理
bash
提取区域
samtools view -b aligned.bam chr1:1000000-2000000 > region.bam
将 BAM 转换为 FASTQ
samtools fastq -1 R1.fq.gz -2 R2.fq.gz aligned.bam
合并 BAM 文件
samtools merge merged.bam sample1.bam sample2.bam
按区域提取 VCF 子集
bcftools view -r chr1:1000-2000 variants.vcf.gz
安全与隐私
数据访问:
- - 仅读取用户明确提供的输入文件
- 将输出写入用户指定的目录
- 将偏好设置存储在 ~/bioinformatics/(需用户同意)
本地存储的数据:
- - 所有序列数据在本地处理
- 分析过程中不调用外部 API
- 流程配置存储在 ~/bioinformatics/
本技能不会:
- - 将序列数据上传到任何地方
- 在无明确用户指令的情况下访问文件
- 推断或收集超出明确输入范围的数据
- 在分析过程中发起网络请求
注意: 安装工具(conda、brew)和下载参考基因组需要互联网访问。这些是用户发起的操作。
相关技能
如果用户确认,使用 clawhub install
安装:
- - data-analysis — 统计解读
- statistics — 假设检验
- science — 研究方法论
反馈
- - 如果觉得有用:clawhub star bioinformatics
- 保持更新:clawhub sync