Chemistry Query Agent v1.4.1
Overview
Full-stack chemistry toolkit combining PubChem data retrieval with RDKit molecule processing, visualization, analysis, retrosynthesis, and synthesis planning. All outputs are structured JSON for easy downstream chaining. Generates PNG/SVG images on demand.
Key capabilities:
- - PubChem compound lookup (info, structure, synthesis refs, similarity search)
- RDKit molecular properties (MW, logP, TPSA, HBD/HBA, rotatable bonds, aromatic rings)
- 2D molecule visualization (PNG/SVG)
- BRICS retrosynthesis with recursive depth control
- Multi-step synthesis route planning
- Forward reaction simulation with SMARTS templates
- Morgan fingerprints and similarity/substructure search
- 21 named reaction templates (Suzuki, Heck, Grignard, Wittig, Diels-Alder, etc.)
Quick Start
CODEBLOCK0
Scripts
scripts/query_pubchem.py
PubChem REST API queries with automatic name→CID resolution and timeout handling.
CODEBLOCK1
- - info: Formula, MW, IUPAC name, InChIKey (JSON)
- structure: SMILES, InChI, image URL, or full JSON
- synthesis: Synonyms/references for a compound
- similar: Similar compounds by 2D fingerprint (top 20)
scripts/rdkit_mol.py
RDKit cheminformatics engine. Resolves names via PubChem automatically.
CODEBLOCK2
| Action | Description | Key Args |
|---|
| props | MW, logP, TPSA, HBD, HBA, rotB, aromRings | INLINECODE2 |
| draw |
2D PNG/SVG (300×300) |
--smiles --output file.png --format png\|svg |
| retro | BRICS recursive retrosynthesis |
--target <SMILES\|name> --depth N |
| plan | Multi-step retro route |
--target <SMILES\|name> --steps N |
| react | Forward reaction via SMARTS |
--reactants "smi1 smi2" --smarts "<SMARTS>" |
| fingerprint | Morgan fingerprint bitvector |
--smiles --radius 2 |
| similarity | Tanimoto similarity scoring |
--query_smiles --target_smiles "smi1,smi2" |
| substruct | Substructure matching |
--query_smiles --target_smiles "smi1,smi2" |
| xyz | 3D coordinates (MMFF optimized) |
--smiles |
scripts/chain_entry.py
Standard agent chain interface. Accepts
{"smiles": "...", "context": "..."} or
{"name": "...", "context": "..."}. Returns unified JSON with props, visualization, and retrosynthesis.
CODEBLOCK3
Output schema:
CODEBLOCK4
scripts/templates.json
21 named reaction templates with SMARTS, expected yields, conditions, and references. Includes: Suzuki, Heck, Buchwald-Hartwig, Grignard, Wittig, Diels-Alder, Click, Sonogashira, Negishi, and more.
Chaining
- 1. Name → Full Profile:
chain_entry.py with {"name": "ibuprofen"} → props + draw + retro - Chemistry → Pharmacology: Output feeds directly into INLINECODE17
- Retro + Viz: Get precursors, then draw each one
- Suzuki Test: INLINECODE18
Tested With
All features verified end-to-end with RDKit 2024.03+:
| Molecule | SMILES | Tests Passed |
|---|
| Caffeine | INLINECODE19 | info, structure, props, draw, retro, plan, chain |
| Aspirin |
CC(=O)Oc1ccccc1C(=O)O | info, structure, props, draw, retro, plan, chain |
| Sotorasib | PubChem name lookup | info, structure, props, draw, retro, chain |
| Ibuprofen | PubChem name lookup | info, structure, props, chain |
| Invalid SMILES |
XXXINVALID | Graceful JSON error |
| Empty input |
{} | Graceful JSON error |
Resources
- -
references/api_endpoints.md — PubChem API endpoint reference and rate limits - INLINECODE24 — Legacy reaction module
- INLINECODE25 ,
scripts/pubmed_search.py, scripts/admet_predict.py — Additional query modules
scripts/advanced_chem.py
Advanced cheminformatics engine with 6 Tier 1 capabilities.
CODEBLOCK5
| Action | Description | Key Args |
|---|
| standardize | Salt stripping, charge normalization, tautomer enumeration | INLINECODE29 |
| descriptors |
217+ molecular descriptors (RDKit full set), QED, SA Score, Lipinski/Veber rules |
--smiles --descriptor_set all\|druglike\|physical\|topological |
| scaffold | Murcko scaffold extraction, generic scaffolds, diversity analysis, R-group decomposition |
--smiles or
--target_smiles "smi1,smi2,..." --rgroup_core <SMARTS> |
| mcs | Maximum Common Substructure across 2+ molecules |
--target_smiles "smi1,smi2,..." |
| mmpa | Matched Molecular Pair Analysis — find single-point transformations |
--target_smiles "smi1,smi2,..." |
| chemspace | Chemical space visualization (PCA/t-SNE/UMAP scatter plot PNG) |
--target_smiles "smi1,smi2,..." --method pca\|tsne\|umap --output plot.png |
Examples:
CODEBLOCK6
Changelog
v2.0.0 (2026-02-28)
- - NEW:
advanced_chem.py with 6 Tier 1 cheminformatics capabilities
- Molecular Standardization & Tautomer Enumeration (salt stripping, charge normalization, canonical tautomers)
- Extended Descriptors (217+ RDKit descriptors, QED, SA Score, Lipinski, Veber)
- Scaffold Analysis (Murcko, generic scaffolds, diversity ratio, R-group decomposition)
- Maximum Common Substructure (rdFMCS with coverage per molecule)
- Matched Molecular Pair Analysis (rdMMPA fragmentation, transformation detection)
- Chemical Space Visualization (PCA/t-SNE/UMAP with matplotlib scatter plots)
- - Dependencies: scikit-learn, matplotlib (added)
v1.4.1 (2026-02-25)
- - Security hardening: input sanitization for all subprocess calls (SMILES, compound names, output paths)
- Added
_sanitize_input() — length limits, null-byte rejection for all user inputs - Added
_sanitize_output_path() — prevents path traversal, restricts extensions, blocks arbitrary file writes - Added shell metacharacter rejection in INLINECODE40
- Added SMILES validation via RDKit in
chem_ui.py before subprocess calls - Added compound input validation in
query_pubchem.py (length/null-byte checks) - Added timeout to
resolve_target() PubChem subprocess call - Addresses VirusTotal "suspicious" classification for argument injection vectors
v1.4.0 (2026-02-14)
- - Fixed PubChem SMILES/InChI endpoint (property/CanonicalSMILES/TXT)
- Fixed chainentry.py HTML entity corruption
- Fixed bricsretro to handle BRICSDecompose string output correctly
- Added request timeouts (15s) to all PubChem calls
- Graceful error handling for invalid SMILES and empty input
- Updated chain output version and schema
- Comprehensive end-to-end testing
v1.3.0
- - RDKit props NoneType fixes, invalid SMILES graceful errors
- React fix: ReactionFromSmarts import
- Name resolution via PubChem for all RDKit actions
v1.2.0
- - BRICS retrosynthesis + 21 reaction templates library
- Multi-step synthesis planning
化学查询代理 v1.4.1
概述
全栈化学工具包,结合PubChem数据检索与RDKit分子处理、可视化、分析、逆合成及合成规划。所有输出均为结构化JSON格式,便于下游链式调用。可按需生成PNG/SVG图像。
关键能力:
- - PubChem化合物查询(信息、结构、合成参考文献、相似性搜索)
- RDKit分子属性(分子量、logP、TPSA、氢键供体/受体、可旋转键、芳香环)
- 2D分子可视化(PNG/SVG)
- 支持递归深度控制的BRICS逆合成
- 多步合成路线规划
- 基于SMARTS模板的正向反应模拟
- Morgan指纹及相似性/子结构搜索
- 21种命名反应模板(Suzuki、Heck、Grignard、Wittig、Diels-Alder等)
快速开始
bash
PubChem化合物信息
exec python scripts/query_pubchem.py --compound aspirin --type info
从SMILES计算分子属性
exec python scripts/rdkit_mol.py --smiles CC(=O)Oc1ccccc1C(=O)O --action props
逆合成
exec python scripts/rdkit_mol.py --target CC(=O)Oc1ccccc1C(=O)O --action retro --depth 2
完整链式调用(名称 → 属性 + 绘图 + 逆合成)
exec python scripts/chain_entry.py --input-json {name: caffeine, context: user}
脚本
scripts/query_pubchem.py
PubChem REST API查询,支持自动名称→CID解析和超时处理。
--compound <名称|CID> --type [--format smiles|inchi|image|json] [--threshold 80]
- - info: 分子式、分子量、IUPAC名称、InChIKey(JSON格式)
- structure: SMILES、InChI、图像URL或完整JSON
- synthesis: 化合物的同义词/参考文献
- similar: 基于2D指纹的相似化合物(前20个)
scripts/rdkit_mol.py
RDKit化学信息学引擎。通过PubChem自动解析名称。
--smiles --action
| 操作 | 描述 | 关键参数 |
|---|
| props | 分子量、logP、TPSA、氢键供体、氢键受体、可旋转键、芳香环 | --smiles |
| draw |
2D PNG/SVG图像(300×300) | --smiles --output file.png --format png\|svg |
| retro | BRICS递归逆合成 | --target
--depth N |
| plan | 多步逆合成路线 | --target --steps N |
| react | 基于SMARTS的正向反应 | --reactants smi1 smi2 --smarts |
| fingerprint | Morgan指纹位向量 | --smiles --radius 2 |
| similarity | Tanimoto相似性评分 | --querysmiles --targetsmiles smi1,smi2 |
| substruct | 子结构匹配 | --querysmiles --targetsmiles smi1,smi2 |
| xyz | 3D坐标(MMFF优化) | --smiles |
scripts/chain_entry.py
标准代理链式接口。接受{smiles: ..., context: ...}或{name: ..., context: ...}。返回包含属性、可视化和逆合成的统一JSON。
bash
python scripts/chain_entry.py --input-json {name: sotorasib, context: user}
输出模式:
json
{
agent: chemistry-query,
version: 1.4.0,
smiles: <规范SMILES>,
status: success|error,
report: {props: {...}, draw: {...}, retro: {...}},
risks: [],
viz: [path/to/image.png],
recommend_next: [pharmacology, toxicology],
confidence: 0.95,
warnings: [],
timestamp: ISO8601
}
scripts/templates.json
21种命名反应模板,包含SMARTS、预期产率、反应条件和参考文献。包括:Suzuki、Heck、Buchwald-Hartwig、Grignard、Wittig、Diels-Alder、Click、Sonogashira、Negishi等。
链式调用
- 1. 名称 → 完整档案: chainentry.py 使用 {name: ibuprofen} → 属性 + 绘图 + 逆合成
- 化学 → 药理学: 输出直接输入到 pharma-pharmacology-agent
- 逆合成 + 可视化: 获取前体,然后绘制每个前体
- Suzuki测试: --action react --reactants c1ccccc1Br c1ccccc1B(O)O --smarts [c:1][Br:2].[c:3]B(O)O>>[c:1][c:3]
测试验证
所有功能已使用RDKit 2024.03+进行端到端验证:
| 分子 | SMILES | 通过的测试 |
|---|
| 咖啡因 | CN1C=NC2=C1C(=O)N(C(=O)N2C)C | info, structure, props, draw, retro, plan, chain |
| 阿司匹林 |
CC(=O)Oc1ccccc1C(=O)O | info, structure, props, draw, retro, plan, chain |
| 索托拉西布 | PubChem名称查询 | info, structure, props, draw, retro, chain |
| 布洛芬 | PubChem名称查询 | info, structure, props, chain |
| 无效SMILES | XXXINVALID | 优雅的JSON错误 |
| 空输入 | {} | 优雅的JSON错误 |
资源
- - references/apiendpoints.md — PubChem API端点参考和速率限制
- scripts/rdkitreaction.py — 遗留反应模块
- scripts/chemblquery.py、scripts/pubmedsearch.py、scripts/admet_predict.py — 其他查询模块
scripts/advanced_chem.py
高级化学信息学引擎,具备6项一级能力。
--action --smiles [选项]
| 操作 | 描述 | 关键参数 |
|---|
| standardize | 盐去除、电荷归一化、互变异构体枚举 | --smiles |
| descriptors |
217+分子描述符(RDKit全集)、QED、SA评分、Lipinski/Veber规则 | --smiles --descriptor_set all\|druglike\|physical\|topological |
| scaffold | Murcko骨架提取、通用骨架、多样性分析、R基团分解 | --smiles 或 --targetsmiles smi1,smi2,... --rgroupcore |
| mcs | 2+分子间的最大公共子结构 | --target_smiles smi1,smi2,... |
| mmpa | 匹配分子对分析 — 寻找单点变换 | --target_smiles smi1,smi2,... |
| chemspace | 化学空间可视化(PCA/t-SNE/UMAP散点图PNG) | --target_smiles smi1,smi2,... --method pca\|tsne\|umap --output plot.png |
示例:
bash
标准化盐形式
python scripts/advanced_chem.py --action standardize --smiles [Na+].CC(=O)[O-]
完整描述符(217+)
python scripts/advancedchem.py --action descriptors --smiles CC(=O)Oc1ccccc1C(=O)O --descriptorset all
集合的骨架多样性
python scripts/advancedchem.py --action scaffold --targetsmiles CC(=O)Oc1ccccc1C(=O)O,CN1C=NC2=C1C(=O)N(C(=O)N2C)C,CC(C)Cc1ccc(cc1)C(C)C(=O)O
阿司匹林和水杨酸的MCS