Gene Structure Mapper
Generate exon-intron structure diagrams for any gene symbol using the Ensembl REST API. Optionally overlay protein domain annotations (UniProt) and mark mutation hotspot positions. Outputs publication-ready SVG, PNG, or PDF figures.
✅ IMPLEMENTED — scripts/main.py is fully functional. Ensembl REST API, caching, matplotlib visualization, --domains, --mutations, and --demo are all implemented.
Quick Check
CODEBLOCK0
When to Use
- - Creating gene structure figures for manuscripts or presentations
- Visualizing splice variants and isoform differences
- Marking mutation positions on a gene diagram for functional annotation
- Overlaying domain boundaries on exon-intron maps
Workflow
- 1. Confirm the user objective, required inputs, and non-negotiable constraints before doing detailed work.
- Validate that the request matches the documented scope and stop early if the task would require unsupported assumptions.
- Use the packaged script path or the documented reasoning path with only the inputs that are actually available.
- Return a structured result that separates assumptions, deliverables, risks, and unresolved items.
- If execution fails or inputs are incomplete, switch to the fallback path and state exactly what blocked full completion.
Fallback template: If scripts/main.py fails or the gene symbol is unrecognized, report: (a) the failure point, (b) whether a manual Ensembl/UCSC lookup can substitute, (c) which output formats are still generatable.
Parameters
| Parameter | Type | Required | Description |
|---|
| INLINECODE5 , INLINECODE6 | string | Yes* | Gene symbol or Ensembl ID (e.g., TP53, BRCA1, ENSG00000141510) |
| INLINECODE10 |
string | No | Species name for Ensembl lookup (default:
homo_sapiens) |
|
--format | string | No | Output format:
png,
svg,
pdf (default:
png) |
|
--output,
-o | string | No | Output file path (default:
<gene>_structure.<format>) |
|
--domains | flag | No | Fetch and overlay UniProt protein domain annotations |
|
--mutations | string | No | Comma-separated codon positions to mark (e.g.,
248,273) |
|
--demo | flag | No | Use hardcoded TP53 GRCh38 data — no internet required |
*Required unless --demo is used.
Usage
CODEBLOCK1
Implementation Notes (for script developer)
The script must implement:
- 1. Gene lookup —
GET https://rest.ensembl.org/lookup/symbol/homo_sapiens/{gene}?expand=1 to fetch exon coordinates. Cache response to .cache/{gene}_ensembl.json to avoid repeated API calls. Add a 0.1 s delay between requests for batch lookups. The unauthenticated rate limit is 15 requests/second. - Unknown gene handling — catch HTTP 400/404 from Ensembl and exit with code 1: INLINECODE27
- SVG/PNG/PDF output — use
matplotlib or svgwrite to draw exon blocks (filled rectangles) and intron lines scaled to genomic coordinates. --domains flag — fetch UniProt domain annotations and overlay colored domain blocks on the gene structure.--mutations flag — accept comma-separated codon positions; map to exon coordinates and draw vertical markers.--demo flag — use hardcoded TP53 GRCh38 exon coordinates (no internet required) to generate a demo visualization.
Known Limitations
- - For genes with multiple isoforms, the script uses the canonical transcript (Ensembl
is_canonical flag). Other isoforms are not visualized. - Domain overlay (
--domains) maps UniProt amino acid positions to genomic coordinates using CDS length; accuracy may vary for genes with complex splicing. - Ensembl API responses are cached to
.cache/{gene}_ensembl.json. Delete the cache file to force a fresh lookup. - The unauthenticated Ensembl REST API rate limit is 15 requests/second; a 0.1 s delay is applied between batch requests.
Features
- - Exon-intron visualization scaled to genomic coordinates
- Protein domain annotation overlay via UniProt (optional,
--domains) - Mutation position markers with configurable labels (
--mutations) - Publication-ready output in SVG, PNG, or PDF
- Demo mode for offline testing (
--demo) - Ensembl API response caching to avoid rate-limit issues
Output Requirements
Every response must make these explicit:
- - Objective and deliverable
- Inputs used and assumptions introduced (e.g., genome build, transcript isoform selected)
- Workflow or decision path taken
- Core result: gene structure figure file path
- Constraints, risks, caveats (e.g., multi-isoform genes, annotation version)
- Unresolved items and next-step checks
Input Validation
This skill accepts: gene symbol inputs for structure visualization, with optional domain and mutation overlays.
If the request does not involve gene structure visualization — for example, asking to perform sequence alignment, predict protein structure, or analyze expression data — do not proceed. Instead respond:
"gene-structure-mapper is designed to visualize gene exon-intron structure. Your request appears to be outside this scope. Please provide a gene symbol and desired output format, or use a more appropriate tool for your task."
Error Handling
- - If
--gene is missing, state that the gene symbol is required and provide an example. - If the gene symbol is not found in Ensembl (HTTP 400/404), print:
Error: Gene not found: {gene_name}. Check the gene symbol and try again. and exit with code 1. - If
--mutations contains non-numeric values, reject with: INLINECODE43 - If the task goes outside the documented scope, stop instead of guessing or silently widening the assignment.
- If
scripts/main.py fails, report the failure point, summarize what still can be completed safely, and provide a manual fallback. - Do not fabricate files, citations, data, search results, or execution outcomes.
Response Template
- 1. Objective
- Inputs Received
- Assumptions
- Workflow
- Deliverable
- Risks and Limits
- Next Checks
基因结构映射器
使用Ensembl REST API为任何基因符号生成外显子-内含子结构图。可选择叠加蛋白质结构域注释(UniProt)并标记突变热点位置。输出可直接用于出版的SVG、PNG或PDF格式图片。
✅ 已实现 — scripts/main.py功能完整。Ensembl REST API、缓存、matplotlib可视化、--domains、--mutations和--demo均已实现。
快速检查
bash
python -m py_compile scripts/main.py
python scripts/main.py --help
python scripts/main.py --demo --output demo.png
使用场景
- - 为手稿或演示文稿创建基因结构图
- 可视化剪接变体和异构体差异
- 在基因图上标记突变位置以进行功能注释
- 在外显子-内含子图谱上叠加结构域边界
工作流程
- 1. 在开始详细工作前,确认用户目标、所需输入和不可协商的约束条件。
- 验证请求是否与文档范围匹配,如果任务需要不支持的假设则提前停止。
- 使用打包的脚本路径或文档化的推理路径,仅使用实际可用的输入。
- 返回结构化结果,区分假设、交付物、风险和未解决项。
- 如果执行失败或输入不完整,切换到备用路径并明确说明阻碍完整完成的原因。
备用模板: 如果scripts/main.py失败或基因符号无法识别,报告:(a)失败点,(b)是否可以通过手动Ensembl/UCSC查询替代,(c)哪些输出格式仍可生成。
参数
| 参数 | 类型 | 必需 | 描述 |
|---|
| --gene, -g | 字符串 | 是* | 基因符号或Ensembl ID(例如TP53、BRCA1、ENSG00000141510) |
| --species |
字符串 | 否 | Ensembl查询的物种名称(默认:homo_sapiens) |
| --format | 字符串 | 否 | 输出格式:png、svg、pdf(默认:png) |
| --output, -o | 字符串 | 否 | 输出文件路径(默认:
_structure.) |
| --domains | 标志 | 否 | 获取并叠加UniProt蛋白质结构域注释 |
| --mutations | 字符串 | 否 | 要标记的逗号分隔密码子位置(例如248,273) |
| --demo | 标志 | 否 | 使用硬编码的TP53 GRCh38数据——无需网络 |
*除非使用--demo,否则为必需。
使用方法
text
python scripts/main.py --gene TP53 --format png
python scripts/main.py --gene BRCA1 --format png --domains --output brca1_structure.png
python scripts/main.py --gene KRAS --mutations 12,13,61 --format pdf
python scripts/main.py --demo
python scripts/main.py --demo --output demo.png --format svg
实现说明(供脚本开发者参考)
脚本必须实现:
- 1. 基因查询 — GET https://rest.ensembl.org/lookup/symbol/homosapiens/{gene}?expand=1获取外显子坐标。将响应缓存到.cache/{gene}ensembl.json以避免重复API调用。批量查询时在请求之间添加0.1秒延迟。未认证的速率限制为15次请求/秒。
- 未知基因处理 — 捕获Ensembl的HTTP 400/404错误并以代码1退出:Error: Gene not found: {gene_name}. Check the gene symbol and try again.
- SVG/PNG/PDF输出 — 使用matplotlib或svgwrite绘制按基因组坐标缩放的外显子块(填充矩形)和内含子线条。
- --domains标志 — 获取UniProt结构域注释并在基因结构上叠加彩色结构域块。
- --mutations标志 — 接受逗号分隔的密码子位置;映射到外显子坐标并绘制垂直标记。
- --demo标志 — 使用硬编码的TP53 GRCh38外显子坐标(无需网络)生成演示可视化。
已知限制
- - 对于具有多个异构体的基因,脚本使用规范转录本(Ensembl iscanonical标志)。其他异构体不会被可视化。
- 结构域叠加(--domains)使用CDS长度将UniProt氨基酸位置映射到基因组坐标;对于具有复杂剪接的基因,准确性可能有所不同。
- Ensembl API响应缓存到.cache/{gene}ensembl.json。删除缓存文件以强制重新查询。
- 未认证的Ensembl REST API速率限制为15次请求/秒;批量请求之间应用0.1秒延迟。
功能特性
- - 按基因组坐标缩放的外显子-内含子可视化
- 通过UniProt的蛋白质结构域注释叠加(可选,--domains)
- 带有可配置标签的突变位置标记(--mutations)
- 可直接用于出版的SVG、PNG或PDF输出
- 离线测试的演示模式(--demo)
- Ensembl API响应缓存以避免速率限制问题
输出要求
每个响应必须明确包含:
- - 目标和交付物
- 使用的输入和引入的假设(例如基因组版本、选择的转录本异构体)
- 采用的工作流程或决策路径
- 核心结果:基因结构图文件路径
- 约束条件、风险、注意事项(例如多异构体基因、注释版本)
- 未解决项和后续检查
输入验证
本技能接受:用于结构可视化的基因符号输入,可选的域和突变叠加。
如果请求不涉及基因结构可视化——例如要求进行序列比对、预测蛋白质结构或分析表达数据——则不要继续。而是回复:
gene-structure-mapper旨在可视化基因外显子-内含子结构。您的请求似乎超出了此范围。请提供基因符号和所需的输出格式,或使用更适合您任务的工具。
错误处理
- - 如果缺少--gene,说明基因符号是必需的并提供示例。
- 如果在Ensembl中找不到基因符号(HTTP 400/404),打印:Error: Gene not found: {gene_name}. Check the gene symbol and try again.并以代码1退出。
- 如果--mutations包含非数值,拒绝并提示:Error: --mutations must be comma-separated integers (codon positions).
- 如果任务超出文档范围,停止而不是猜测或悄悄扩大任务范围。
- 如果scripts/main.py失败,报告失败点,总结仍可安全完成的内容,并提供手动备用方案。
- 不要伪造文件、引用、数据、搜索结果或执行结果。
响应模板
- 1. 目标
- 收到的输入
- 假设
- 工作流程
- 交付物
- 风险和限制
- 后续检查