Bio-Ontology Mapper
Overview
Biomedical terminology normalization tool that maps free-text clinical and scientific concepts to standardized ontologies for semantic interoperability and data harmonization.
Key Capabilities:
- - Multi-Ontology Support: SNOMED CT, MeSH, ICD-10, LOINC, RxNorm
- Entity Extraction: NER for diseases, symptoms, procedures, drugs
- Fuzzy Matching: Handle typos, abbreviations, and synonyms
- Confidence Scoring: Reliability metrics for each mapping
- Batch Processing: Normalize large datasets efficiently
- Cross-Mapping: Translate between ontology systems
When to Use
✅ Use this skill when:
- - Normalizing clinical notes for EHR integration
- Standardizing terminology for multi-site studies
- Mapping legacy data to modern ontologies
- Preparing data for clinical data warehouses
- Converting free-text to coded data for analysis
- Building semantic search for biomedical literature
- Teaching biomedical informatics principles
❌ Do NOT use when:
- - Clinical diagnosis or decision support → Use clinical decision tools
- Real-time patient care → Latency too high for acute settings
- Replacing expert coding → Use for pre-coding, final review needed
- Processing PHI without de-identification → Ensure HIPAA compliance
Integration:
- - Upstream:
clinical-data-cleaner (data preparation), ehr-semantic-compressor (text extraction) - Downstream:
clinical-data-cleaner (SDTM mapping), unstructured-medical-text-miner (NLP pipelines)
Core Capabilities
1. Entity Recognition and Mapping
Extract and map biomedical entities to ontologies:
CODEBLOCK0
Supported Ontologies:
| Ontology | Domain | Use Case |
|---|
| SNOMED CT | Clinical | EHR interoperability |
| MeSH |
Literature | PubMed indexing |
|
ICD-10 | Billing | Diagnosis codes |
|
LOINC | Labs | Test result standardization |
|
RxNorm | Drugs | Medication normalization |
|
HGNC | Genes | Gene name standardization |
2. Cross-Ontology Translation
Map concepts between different ontologies:
CODEBLOCK1
Cross-Mapping Coverage:
- - SNOMED CT ↔ ICD-10-CM (clinical modifications)
- MeSH ↔ SNOMED CT (literature to clinical)
- RxNorm ↔ ATC (drug classifications)
- LOINC ↔ SNOMED (lab to clinical)
3. Batch Normalization
Process large datasets:
CODEBLOCK2
Performance:
- - ~100 terms/second (with caching)
- ~20 terms/second (API lookup)
- Parallel processing for large datasets
4. Confidence Scoring and Validation
Assess mapping reliability:
CODEBLOCK3
Scoring Factors:
- - String similarity: Levenshtein distance, n-grams
- Context match: Surrounding words alignment
- Frequency: Common usage in corpus
- Semantic similarity: Vector embeddings
Common Patterns
Pattern 1: Clinical Note Normalization
Scenario: Convert free-text diagnoses to SNOMED codes.
CODEBLOCK4
Post-Processing:
- - Review low-confidence mappings (<0.8)
- Handle ambiguous terms manually
- Validate against clinical context
Pattern 2: Literature Indexing
Scenario: Map research paper keywords to MeSH.
CODEBLOCK5
Pattern 3: Drug Name Normalization
Scenario: Standardize medication names across datasets.
CODEBLOCK6
Pattern 4: EHR Data Harmonization
Scenario: Merge data from multiple hospital systems.
CODEBLOCK7
Complete Workflow Example
From free-text to coded database:
CODEBLOCK8
Quality Checklist
Pre-Mapping:
- - [ ] Text preprocessed (lowercase, punctuation handled)
- [ ] Abbreviations expanded where possible
- [ ] Language identified (multilingual support)
During Mapping:
- - [ ] Confidence threshold appropriate (>0.7 for clinical)
- [ ] Multiple candidates considered for ambiguous terms
- [ ] Context used for disambiguation
Post-Mapping:
- - [ ] Low-confidence mappings flagged for review
- [ ] Unmapped terms logged
- [ ] CRITICAL: Clinical expert validation for high-stakes use
Before Production:
- - [ ] Mapping accuracy validated on gold standard
- [ ] False positive rate acceptable (<5%)
- [ ] Recall acceptable for use case (>90%)
- [ ] API rate limits respected
Common Pitfalls
Mapping Errors:
- - ❌ Abbreviation ambiguity → "MI" = Myocardial infarction OR Michigan
- ✅ Use context; flag for manual review
- - ❌ Outdated terms → Old terminology not in current ontology
- ✅ Use historical mappings; update terminology
- - ❌ False confidence → High score for wrong concept
- ✅ Always review top-3 candidates
Technical Issues:
- - ❌ API failures → No local fallback
- ✅ Implement caching; use local reference files
- - ❌ Version mismatches → Different ontology versions
- ✅ Track ontology version used
- - ❌ PHI exposure → Sending patient data to external APIs
- ✅ De-identify before API calls; use local processing when possible
References
Available in references/ directory:
- -
snomed_ct_guide.md - SNOMED CT hierarchy and relationships - INLINECODE6 - MeSH tree structure and qualifiers
- INLINECODE7 - Crosswalks between systems
- INLINECODE8 - Biomedical text processing
- INLINECODE9 - External service integration
- INLINECODE10 - Gold standard test sets
Scripts
Located in scripts/ directory:
- -
main.py - CLI interface for mapping - INLINECODE13 - Core ontology mapping engine
- INLINECODE14 - Named entity recognition
- INLINECODE15 - Ontology-to-ontology translation
- INLINECODE16 - Confidence calculation
- INLINECODE17 - Large dataset handling
- INLINECODE18 - Mapping quality checks
- INLINECODE19 - Local storage for frequent lookups
Limitations
- - Ambiguity: Many-to-many mappings common; context required
- Coverage: Rare diseases and new concepts may not be in ontologies
- Versioning: Ontology updates can change mappings over time
- Language: Best support for English; other languages limited
- Real-time: Not suitable for time-critical clinical applications
- API Dependency: Requires internet for most lookups (caching helps)
⚠️ Critical: Ontology mapping is for research and data integration, not clinical decision-making. Always validate mappings with domain experts before use in patient care contexts. Never process PHI without appropriate de-identification and compliance measures.
Parameters
| Parameter | Type | Default | Description |
|---|
| INLINECODE20 | str | Required | Single term to map |
| INLINECODE21 |
str | Required | Input file path |
|
--output | str | Required | Output file path |
|
--ontology | str | 'both' | |
|
--threshold | float | 0.7 | |
|
--format | str | 'json' | |
|
--use-api | str | Required | Use UMLS/MeSH APIs |
|
--api-key | str | Required | |
生物本体映射器
概述
生物医学术语标准化工具,可将自由文本的临床和科学概念映射到标准化本体,实现语义互操作和数据协调。
核心能力:
- - 多本体支持:SNOMED CT、MeSH、ICD-10、LOINC、RxNorm
- 实体提取:针对疾病、症状、手术、药物的命名实体识别
- 模糊匹配:处理拼写错误、缩写和同义词
- 置信度评分:每次映射的可靠性指标
- 批量处理:高效标准化大型数据集
- 交叉映射:在本体系统之间进行转换
使用场景
✅ 适用场景:
- - 标准化临床记录以集成到电子健康档案
- 为多中心研究标准化术语
- 将遗留数据映射到现代本体
- 为临床数据仓库准备数据
- 将自由文本转换为编码数据用于分析
- 构建生物医学文献的语义搜索
- 教授生物信息学原理
❌ 不适用场景:
- - 临床诊断或决策支持 → 应使用临床决策工具
- 实时患者护理 → 对急性场景延迟过高
- 替代专家编码 → 用于预编码,需最终审核
- 处理未去标识化的受保护健康信息 → 确保符合HIPAA法规
集成:
- - 上游:clinical-data-cleaner(数据准备)、ehr-semantic-compressor(文本提取)
- 下游:clinical-data-cleaner(SDTM映射)、unstructured-medical-text-miner(自然语言处理流水线)
核心功能
1. 实体识别与映射
提取生物医学实体并映射到本体:
python
from scripts.mapper import BioOntologyMapper
mapper = BioOntologyMapper()
映射临床文本
result = mapper.map_text(
text=患者患有糖尿病和高血压,正在服用二甲双胍,
ontologies=[snomed, mesh, rxnorm],
confidence_threshold=0.7
)
for entity in result.entities:
print(f{entity.text} → {entity.concept_id} ({entity.ontology}))
print(f 首选术语:{entity.preferred_term})
print(f 置信度:{entity.confidence:.2f})
支持的本体:
| 本体 | 领域 | 使用场景 |
|---|
| SNOMED CT | 临床 | 电子健康档案互操作 |
| MeSH |
文献 | PubMed索引 |
|
ICD-10 | 计费 | 诊断编码 |
|
LOINC | 检验 | 检测结果标准化 |
|
RxNorm | 药物 | 药物名称标准化 |
|
HGNC | 基因 | 基因名称标准化 |
2. 跨本体翻译
在不同本体之间映射概念:
python
将SNOMED交叉映射到ICD-10
translation = mapper.cross_map(
source_id=22298006, # SNOMED:心肌梗死
source_ontology=snomed,
target_ontology=icd10
)
print(fICD-10:{translation.targetid} - {translation.targetterm})
输出:I21.9 - 急性心肌梗死,未特指
交叉映射覆盖范围:
- - SNOMED CT ↔ ICD-10-CM(临床修订版)
- MeSH ↔ SNOMED CT(文献到临床)
- RxNorm ↔ ATC(药物分类)
- LOINC ↔ SNOMED(检验到临床)
3. 批量标准化
处理大型数据集:
python
批量处理CSV
results = mapper.batch_map(
input
file=clinicalterms.csv,
text
column=diagnosisdescription,
ontologies=[snomed, icd10],
output_format=csv,
max_workers=4
)
结果包括:
- 原始术语
- 映射的概念ID
- 置信度分数
- 备选映射(如有歧义)
性能:
- - 约100个术语/秒(使用缓存)
- 约20个术语/秒(API查询)
- 大型数据集支持并行处理
4. 置信度评分与验证
评估映射可靠性:
python
scoring = mapper.score_mapping(
term=心脏病发作,
candidate=22298006, # 心肌梗死
factors=[stringsimilarity, contextmatch, frequency]
)
print(f总体置信度:{scoring.confidence:.2f})
print(f分解:{scoring.factors})
评分因素:
- - 字符串相似度:莱文斯坦距离、n-gram
- 上下文匹配:周围词语对齐
- 频率:语料库中的常见用法
- 语义相似度:向量嵌入
常见模式
模式1:临床记录标准化
场景:将自由文本诊断转换为SNOMED编码。
bash
标准化临床记录
python scripts/main.py \
--input notes.csv \
--column diagnosis_text \
--ontology snomed \
--threshold 0.8 \
--output coded_diagnoses.csv
结果:心脏病发作 → 22298006(心肌梗死)
后处理:
- - 审查低置信度映射(<0.8)
- 手动处理歧义术语
- 根据临床上下文验证
模式2:文献索引
场景:将研究论文关键词映射到MeSH。
python
将关键词映射到MeSH
mesh
terms = mapper.mapto_mesh(
keywords=[癌症免疫治疗, 检查点抑制剂, PD-1],
include
treenumbers=True,
include_qualifiers=True
)
for term in mesh_terms:
print(f{term.input} → {term.descriptor})
print(f 树状编号:{term.tree_numbers})
print(f 入口词:{term.synonyms})
模式3:药物名称标准化
场景:跨数据集标准化药物名称。
python
标准化药物名称
drugs = [泰诺, 雅维, 美林, 对乙酰氨基酚]
for drug in drugs:
result = mapper.maptorxnorm(drug)
print(f{drug} → {result.rxcui}:{result.name})
# 泰诺 → 161:对乙酰氨基酚
# 雅维 → 5640:布洛芬
# 美林 → 5640:布洛芬
模式4:电子健康档案数据协调
场景:合并来自多个医院系统的数据。
bash
协调3家医院的诊断数据
python scripts/main.py \
--batch \
--inputs hospital
a.csv,hospitalb.csv,hospital_c.csv \
--target-ontology snomed \
--cross-map-to icd10 \
--output harmonized_data.csv
完整工作流示例
从自由文本到编码数据库:
python
from scripts.mapper import BioOntologyMapper
from scripts.validator import MappingValidator
初始化
mapper = BioOntologyMapper()
validator = MappingValidator()
步骤1:从文本中提取实体
clinical_note = 患者患有2型糖尿病和高血压...
entities = mapper.extract
entities(clinicalnote)
步骤2:映射到SNOMED
mappings = []
for entity in entities:
mapping = mapper.map
tosnomed(
entity.text,
context=clinical_note,
top_n=3
)
mappings.append(mapping)
步骤3:验证映射
for mapping in mappings:
validation = validator.validate(
mapping,
check
clinicalplausibility=True
)
if not validation.is_valid:
print(f需要审查:{mapping})
步骤4:导出为数据库格式
db
records = [m.todatabase_record() for m in mappings]
质量检查清单
映射前:
- - [ ] 文本预处理(小写、标点处理)
- [ ] 尽可能展开缩写
- [ ] 识别语言(多语言支持)
映射过程中:
- - [ ] 置信度阈值适当(临床场景>0.7)
- [ ] 歧义术语考虑多个候选
- [ ] 使用上下文进行消歧
映射后:
- - [ ] 低置信度映射标记待审查
- [ ] 未映射术语记录日志
- [ ] 关键:高风险场景需临床专家验证
投入生产前:
- - [ ] 映射准确性在黄金标准上验证
- [ ] 假阳性率可接受(<5%)
- [ ] 召回率满足用例需求(>90%)
- [ ] 遵守API速率限制
常见陷阱
映射错误:
- - ❌ 缩写歧义 → MI = 心肌梗死 或 密歇根州
- ✅ 使用上下文;标记待人工审查
-