Setup

On first use, read setup.md for integration guidelines. Create ~/bioinformatics/ with user consent to store project context and preferences.

When to Use

User needs to analyze biological sequences, run genomic pipelines, or interpret sequencing data. Agent handles sequence alignment, variant calling, expression analysis, and format conversions.

Architecture

Memory lives in ~/bioinformatics/. See memory-template.md for structure.

CODEBLOCK0

Quick Reference

Topic	File
Setup process	INLINECODE4
Memory template

Core Rules

1. Verify Input Quality First

Before any analysis, check input data quality:

- FASTQ: Run FastQC, check per-base quality, adapter content
BAM: Verify sorted, indexed (samtools quickcheck)
VCF: Validate format (bcftools view -h)

Bad input → garbage output. Always QC first.

2. Use Reference Genome Consistently

Track which reference is used per project:

- Human: GRCh38/hg38 (prefer) or GRCh37/hg19
Mouse: GRCm39/mm39 or GRCm38/mm10
Mixing references = invalid results

Store reference info in ~/bioinformatics/memory.md per project.

3. Preserve Raw Data

NEVER modify original FASTQ/BAM files:

- Work on copies
Keep originals read-only
Log every transformation step

4. Resource Awareness

Bioinformatics commands can consume massive resources:

- Check file sizes before operations
Use streaming when possible (samtools view | ...)
Estimate memory needs (BWA: ~6GB for human genome)
Warn before operations >10 minutes

5. Reproducibility

Every analysis must be reproducible:

- Log exact tool versions (samtools --version)
Save command parameters
Record input file checksums for critical analyses

Common Traps

- Wrong chromosome naming — chr1 vs 1 causes silent failures. Check and convert with INLINECODE17
Unsorted BAM — Most tools expect sorted input. Symptoms: errors or wrong results with no warning
Index missing — BAM needs .bai, VCF needs .tbi. Commands fail cryptically without them
Memory exhaustion — Large BAM operations kill the session. Stream or use --threads wisely
Stale indices — After modifying BAM/VCF, regenerate index. Old index = corrupt reads
0-based vs 1-based coordinates — BED is 0-based, VCF/GFF is 1-based. Off-by-one bugs are common

File Formats Quick Reference

Format	Purpose	Key Tool
FASTA	Reference sequences	INLINECODE21
FASTQ

Essential Commands

Quality Control

CODEBLOCK1

Alignment

CODEBLOCK2

Variant Calling

CODEBLOCK3

Data Manipulation

CODEBLOCK4

Security & Privacy

Data access:

- Only reads files user explicitly provides as input
Writes outputs to directories user specifies
Stores preferences in ~/bioinformatics/ (with consent)

Data that stays local:

- All sequence data processed locally
No external API calls for analysis
Pipeline configs in ~/bioinformatics/

This skill does NOT:

- Upload sequence data anywhere
Access files without explicit user instruction
Infer or collect data beyond explicit inputs
Make network requests during analysis

Note: Installing tools (conda, brew) and downloading reference genomes requires internet access. These are user-initiated actions.

Related Skills

Install with clawhub install <slug> if user confirms:

- data-analysis — statistical interpretation
INLINECODE31 — hypothesis testing
INLINECODE32 — research methodology

Feedback

- If useful: INLINECODE33
Stay updated: INLINECODE34

设置

首次使用时，请阅读 setup.md 了解集成指南。在用户同意的情况下，创建 ~/bioinformatics/ 目录，用于存储项目上下文和偏好设置。

使用场景

用户需要分析生物序列、运行基因组分析流程或解读测序数据。本技能可处理序列比对、变异检测、表达分析和格式转换等任务。

架构

数据存储于 ~/bioinformatics/ 目录。具体结构请参考 memory-template.md。

~/bioinformatics/
├── memory.md # 项目、偏好设置、参考基因组
├── pipelines/ # 已保存的分析流程配置
└── results/ # 分析输出和日志

快速参考

主题	文件
设置流程	setup.md
记忆模板

核心规则

1. 首先验证输入质量

在任何分析之前，检查输入数据质量：

- FASTQ：运行 FastQC，检查每个碱基质量、接头含量
BAM：验证是否已排序、建立索引（samtools quickcheck）
VCF：验证格式（bcftools view -h）

输入质量差 → 输出结果不可靠。始终先进行质控。

2. 统一使用参考基因组

跟踪每个项目使用的参考基因组：

- 人类：GRCh38/hg38（优先）或 GRCh37/hg19
小鼠：GRCm39/mm39 或 GRCm38/mm10
混合使用参考基因组 = 无效结果

将参考基因组信息按项目存储在 ~/bioinformatics/memory.md 中。

3. 保留原始数据

切勿修改原始 FASTQ/BAM 文件：

- 在副本上操作
将原始文件设为只读
记录每一步转换操作

4. 资源意识

生物信息学命令可能消耗大量资源：

- 操作前检查文件大小
尽可能使用流式处理（samtools view | ...）
估算内存需求（BWA：人类基因组约需 6GB）
操作超过 10 分钟前发出警告

5. 可重复性

每次分析必须可重复：

- 记录确切的工具版本（samtools --version）
保存命令参数
对关键分析记录输入文件的校验和

常见陷阱

- 错误的染色体命名 — chr1 与 1 会导致静默失败。使用 sed s/^chr// 检查和转换
未排序的 BAM — 大多数工具需要排序后的输入。症状：错误或错误结果且无警告
缺少索引 — BAM 需要 .bai，VCF 需要 .tbi。缺少索引时命令会以难以理解的方式失败
内存耗尽 — 大型 BAM 操作会终止会话。使用流式处理或合理使用 --threads
过期的索引 — 修改 BAM/VCF 后，重新生成索引。旧索引 = 损坏的读取
0 基与 1 基坐标 — BED 是 0 基，VCF/GFF 是 1 基。差一错误很常见

文件格式快速参考

格式	用途	关键工具
FASTA	参考序列	samtools faidx
FASTQ

基本命令

质量控制

bash

FASTQ 质量报告

fastqc sample.fastq.gz -o qc_reports/

修剪接头 + 低质量碱基

fastp -i R1.fq.gz -I R2.fq.gz -o R1.clean.fq.gz -O R2.clean.fq.gz

BAM 统计信息

samtools flagstat aligned.bam samtools stats aligned.bam > stats.txt

比对

bash

索引参考基因组（仅一次）

bwa index reference.fa

比对双端读取

bwa mem -t 8 reference.fa R1.fq.gz R2.fq.gz | \ samtools sort -o aligned.bam -

索引 BAM

samtools index aligned.bam

变异检测

bash

检测变异

bcftools mpileup -Ou -f reference.fa aligned.bam | \ bcftools call -mv -Oz -o variants.vcf.gz

索引 VCF

bcftools index variants.vcf.gz

过滤变异

bcftools filter -s LowQual -e QUAL<20 variants.vcf.gz

数据处理

bash

提取区域

samtools view -b aligned.bam chr1:1000000-2000000 > region.bam

将 BAM 转换为 FASTQ

samtools fastq -1 R1.fq.gz -2 R2.fq.gz aligned.bam

合并 BAM 文件

samtools merge merged.bam sample1.bam sample2.bam

按区域提取 VCF 子集

bcftools view -r chr1:1000-2000 variants.vcf.gz

安全与隐私

数据访问：

- 仅读取用户明确提供的输入文件
将输出写入用户指定的目录
将偏好设置存储在 ~/bioinformatics/（需用户同意）

本地存储的数据：

- 所有序列数据在本地处理
分析过程中不调用外部 API
流程配置存储在 ~/bioinformatics/

本技能不会：

- 将序列数据上传到任何地方
在无明确用户指令的情况下访问文件
推断或收集超出明确输入范围的数据
在分析过程中发起网络请求

注意： 安装工具（conda、brew）和下载参考基因组需要互联网访问。这些是用户发起的操作。

反馈

- 如果觉得有用：clawhub star bioinformatics
保持更新：clawhub sync

Bioinformatics生物信息学

Bioinformatics

Setup

When to Use

Architecture

Quick Reference

Core Rules

1. Verify Input Quality First

2. Use Reference Genome Consistently

3. Preserve Raw Data

4. Resource Awareness

5. Reproducibility

Common Traps

File Formats Quick Reference

Essential Commands

Quality Control

Alignment

Variant Calling

Data Manipulation

Security & Privacy

Related Skills

Feedback

设置

使用场景

架构

快速参考

核心规则

1. 首先验证输入质量

2. 统一使用参考基因组

3. 保留原始数据

4. 资源意识

5. 可重复性

常见陷阱

文件格式快速参考

基本命令

质量控制

FASTQ 质量报告

修剪接头 + 低质量碱基

BAM 统计信息

比对

索引参考基因组（仅一次）

比对双端读取

索引 BAM

变异检测

检测变异

索引 VCF

过滤变异

数据处理

提取区域

将 BAM 转换为 FASTQ

合并 BAM 文件

按区域提取 VCF 子集

安全与隐私

相关技能

反馈

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement