Skill: Multi-Omics Integration Strategist (ID: 204)
Overview
Designs multi-omics (transcriptomics RNA, proteomics Pro, metabolomics Met) joint analysis schemes, performs cross-validation at the pathway level, and provides systems biology-level integrated analysis strategies.
Use Cases
- - Systems biology mechanism research for complex diseases
- Biomarker discovery and validation
- Drug target identification and pathway validation
- Multi-omics data quality assessment and consistency analysis
Directory Structure
CODEBLOCK0
Input
Required Files
| File | Format | Description |
|---|
| INLINECODE0 | CSV | Transcriptomics data: Gene ID, expression value, differential analysis results |
| INLINECODE1 |
CSV | Proteomics data: Protein ID, abundance value, differential analysis results |
|
met_data.csv | CSV | Metabolomics data: Metabolite ID, concentration value, differential analysis results |
Input Format Specifications
RNA Data (rna_data.csv)
CODEBLOCK1
Protein Data (pro_data.csv)
CODEBLOCK2
Metabolite Data (met_data.csv)
CODEBLOCK3
Integration Strategy
1. ID Mapping Layer
- - RNA → Protein: Mapping through Gene Symbol / UniProt ID
- Protein → Metabolite: Association through KEGG/Reactome enzyme-reaction-metabolite
- RNA → Metabolite: Indirect association through KEGG pathway
2. Pathway Mapping
Supported databases:
- - KEGG (Kyoto Encyclopedia of Genes and Genomes)
- Reactome
- WikiPathways
- GO (Gene Ontology) - Biological Process
3. Cross-Validation Methods
3.1 Directional Consistency Validation
- - Whether the change direction of genes/proteins/metabolites in the same pathway is consistent
- Score: +1 (consistent), -1 (opposite), 0 (no data)
3.2 Correlation Validation
- - Pearson/Spearman correlation analysis
- Cross-omics expression profile clustering
3.3 Pathway Enrichment Concordance
- - Independent enrichment analysis for each omics
- Common enriched pathway identification
3.4 Network Topology Validation
- - Construct cross-omics regulatory network
- Identify key nodes (Hub genes/proteins/metabolites)
Output
1. Integration Report (integration_report.md)
CODEBLOCK4
2. External Visualization Tools (Not Included)
This tool generates analysis results that can be visualized using external tools. Users may export results to:
| Chart Type | Purpose | External Tool Required |
|---|
| Circos Plot | Cross-omics relationship panorama | matplotlib/circlize (user-installed) |
| Pathway Heatmap |
Pathway-level changes | seaborn/complexheatmap (user-installed) |
| Sankey Diagram | Data flow mapping | plotly (user-installed) |
| Network Graph | Molecular interaction network | networkx/cytoscape (networkx is included) |
| Correlation Matrix | Cross-omics correlation | seaborn (user-installed) |
| Bubble Plot | Integrated enrichment analysis | ggplot2/plotly (user-installed) |
Note: This skill focuses on data integration and analysis. Visualization requires separate installation of plotting libraries by the user.
3. Output Files
| File | Description |
|---|
| INLINECODE4 | ID mapping results |
| INLINECODE5 |
Pathway cross-validation scores |
|
consistency_matrix.csv | Cross-omics consistency matrix |
|
network_edges.csv | Network edge list |
|
report.html | Interactive HTML report |
Usage
Basic Usage
CODEBLOCK5
Advanced Options
CODEBLOCK6
Configuration
config/pathways.json
CODEBLOCK7
Dependencies
- - Python >= 3.8
- pandas >= 1.3.0
- numpy >= 1.21.0
- scipy >= 1.7.0
- scikit-learn >= 1.0.0
- networkx >= 2.6.0
- matplotlib >= 3.4.0
- seaborn >= 0.11.0
- gseapy >= 1.0.0 (Pathway enrichment analysis)
References
- 1. Subramanian et al. (2005) PNAS - GSEA method
- Kamburov et al. (2011) NAR - ConsensusPathDB
- Chin et al. (2018) Nature Communications - Multi-omics integration methods review
Version
- - Version: 1.0.0
- Last Updated: 2026-02-06
- Author: OpenClaw Bioinformatics Team
Risk Assessment
| Risk Indicator | Assessment | Level |
|---|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access |
No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
Security Checklist
- - [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
Prerequisites
CODEBLOCK8
Evaluation Criteria
Success Metrics
- - [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
Test Cases
- 1. Basic Functionality: Standard input → Expected output
- Edge Case: Invalid input → Graceful error handling
- Performance: Large dataset → Acceptable processing time
Lifecycle Status
- - Current Stage: Draft
- Next Review Date: 2026-03-06
- Known Issues: None
- Planned Improvements:
- Performance optimization
- Additional feature support
Parameters
| Parameter | Type | Default | Description |
|---|
| INLINECODE9 | str | Required | |
| INLINECODE10 |
str | Required | |
|
--met | str | Required | |
|
--output | str | './results' | |
|
--databases | str | 'KEGG' | |
|
--create-sample | str | Required | Create sample data for testing |
|
--format | str | 'md | |
技能名称: 多组学整合策略师
详细描述:
技能:多组学整合策略师(ID:204)
概述
设计多组学(转录组学RNA、蛋白质组学Pro、代谢组学Met)联合分析方案,在通路层面进行交叉验证,并提供系统生物学层面的整合分析策略。
使用场景
- - 复杂疾病的系统生物学机制研究
- 生物标志物的发现与验证
- 药物靶点识别与通路验证
- 多组学数据质量评估与一致性分析
目录结构
.
├── SKILL.md # 本文件 - 技能文档
├── config/
│ └── pathways.json # 通路数据库配置
├── scripts/
│ └── main.py # 主分析脚本
├── templates/
│ └── report_template.md # 分析报告模板
└── examples/
└── sample_data/ # 示例数据集
输入
必需文件
| 文件 | 格式 | 描述 |
|---|
| rnadata.csv | CSV | 转录组学数据:基因ID、表达值、差异分析结果 |
| prodata.csv |
CSV | 蛋白质组学数据:蛋白质ID、丰度值、差异分析结果 |
| met_data.csv | CSV | 代谢组学数据:代谢物ID、浓度值、差异分析结果 |
输入格式规范
RNA数据 (rna_data.csv)
csv
gene
id,genename,log2fc,pvalue,padj,sample
A,sampleB,...
ENSG00000139618,BRCA1,1.23,0.001,0.005,12.5,13.2,...
蛋白质数据 (pro_data.csv)
csv
protein
id,genename,log2fc,pvalue,padj,sample
A,sampleB,...
P38398,BRCA1,0.85,0.002,0.008,2450,2890,...
代谢物数据 (met_data.csv)
csv
metabolite
id,metabolitename,kegg_id,log2fc,pvalue,padj,...
C00187,Cholesterol,C00187,-1.45,0.003,0.012,...
整合策略
1. ID映射层
- - RNA → 蛋白质:通过基因符号/UniProt ID进行映射
- 蛋白质 → 代谢物:通过KEGG/Reactome酶-反应-代谢物关联
- RNA → 代谢物:通过KEGG通路间接关联
2. 通路映射
支持的数据库:
- - KEGG(京都基因与基因组百科全书)
- Reactome
- WikiPathways
- GO(基因本体论) - 生物过程
3. 交叉验证方法
3.1 方向一致性验证
- - 同一通路中基因/蛋白质/代谢物的变化方向是否一致
- 评分:+1(一致),-1(相反),0(无数据)
3.2 相关性验证
3.3 通路富集一致性
- - 对每个组学进行独立的富集分析
- 识别共同富集的通路
3.4 网络拓扑验证
- - 构建跨组学调控网络
- 识别关键节点(枢纽基因/蛋白质/代谢物)
输出
1. 整合报告 (integration_report.md)
markdown
多组学整合分析报告
执行摘要
- - 样本数量:RNA=30, Pro=28, Met=25
- 映射成功率:RNA-Pro=85%, Pro-Met=62%
- 通路覆盖:342个KEGG通路
交叉验证结果
高度一致的通路(评分 > 0.8)
- 1. 糖酵解/糖异生(评分=0.92)
- 柠檬酸循环(TCA循环)(评分=0.88)
存在冲突的通路(评分 < -0.3)
- 1. 脂肪酸生物合成(评分=-0.45)
建议
- - 重点关注:能量代谢相关通路
- 需验证:脂质代谢通路数据质量
2. 外部可视化工具(不包含)
本工具生成的分析结果可使用外部工具进行可视化。用户可将结果导出至:
| 图表类型 | 用途 | 所需外部工具 |
|---|
| Circos图 | 跨组学关系全景图 | matplotlib/circlize(用户安装) |
| 通路热图 |
通路级别变化 | seaborn/complexheatmap(用户安装) |
| 桑基图 | 数据流映射 | plotly(用户安装) |
| 网络图 | 分子相互作用网络 | networkx/cytoscape(包含networkx) |
| 相关性矩阵 | 跨组学相关性 | seaborn(用户安装) |
| 气泡图 | 整合富集分析 | ggplot2/plotly(用户安装) |
注意: 本技能专注于数据整合与分析。可视化需要用户单独安装绘图库。
3. 输出文件
| 文件 | 描述 |
|---|
| mappedids.json | ID映射结果 |
| pathwayscores.csv |
通路交叉验证评分 |
| consistency_matrix.csv | 跨组学一致性矩阵 |
| network_edges.csv | 网络边列表 |
| report.html | 交互式HTML报告 |
使用方法
基本用法
bash
python scripts/main.py \
--rna rna_data.csv \
--pro pro_data.csv \
--met met_data.csv \
--output ./results
高级选项
bash
python scripts/main.py \
--rna rna_data.csv \
--pro pro_data.csv \
--met met_data.csv \
--pathway-db KEGG,Reactome \
--id-mapping config/mapping.json \
--method correlation+enrichment+network \
--output ./results \
--format html,csv,json
配置
config/pathways.json
json
{
databases: {
KEGG: {
enabled: true,
organism: hsa,
min_genes: 3
},
Reactome: {
enabled: true,
min_genes: 5
}
},
mapping: {
rnatoprotein: gene_symbol,
proteintometabolite: enzyme_commission
}
}
依赖项
- - Python >= 3.8
- pandas >= 1.3.0
- numpy >= 1.21.0
- scipy >= 1.7.0
- scikit-learn >= 1.0.0
- networkx >= 2.6.0
- matplotlib >= 3.4.0
- seaborn >= 0.11.0
- gseapy >= 1.0.0(通路富集分析)
参考文献
- 1. Subramanian et al. (2005) PNAS - GSEA方法
- Kamburov et al. (2011) NAR - ConsensusPathDB
- Chin et al. (2018) Nature Communications - 多组学整合方法综述
版本
- - 版本:1.0.0
- 最后更新:2026-02-06
- 作者:OpenClaw生物信息学团队
风险评估
| 风险指标 | 评估 | 级别 |
|---|
| 代码执行 | 本地执行Python/R脚本 | 中 |
| 网络访问 |
无外部API调用 | 低 |
| 文件系统访问 | 读取输入文件,写入输出文件 | 中 |
| 指令篡改 | 标准提示词指南 | 低 |
| 数据泄露 | 输出文件保存至工作区 | 低 |
安全检查清单
- - [ ] 无硬编码的凭据或API密钥
- [ ] 无未经授权的文件系统访问(../)
- [ ] 输出不泄露敏感信息
- [ ] 已实施提示注入防护
- [ ] 已验证输入文件路径(无../遍历)
- [ ] 输出目录限制在工作区内
- [ ] 在沙盒环境中执行脚本
- [ ] 已清理错误消息(不暴露堆栈跟踪)
- [ ] 已审计依赖项
前置条件
bash
Python依赖项
pip install -r requirements.txt
评估标准
成功指标
- - [ ] 成功执行主要功能
- [ ] 输出符合质量标准
- [ ] 优雅处理边缘情况
- [ ] 性能可接受
测试用例
- 1. 基本功能:标准