Bio-Ontology Mapper

Overview

Biomedical terminology normalization tool that maps free-text clinical and scientific concepts to standardized ontologies for semantic interoperability and data harmonization.

Key Capabilities:

- Multi-Ontology Support: SNOMED CT, MeSH, ICD-10, LOINC, RxNorm
Entity Extraction: NER for diseases, symptoms, procedures, drugs
Fuzzy Matching: Handle typos, abbreviations, and synonyms
Confidence Scoring: Reliability metrics for each mapping
Batch Processing: Normalize large datasets efficiently
Cross-Mapping: Translate between ontology systems

When to Use

✅ Use this skill when:

- Normalizing clinical notes for EHR integration
Standardizing terminology for multi-site studies
Mapping legacy data to modern ontologies
Preparing data for clinical data warehouses
Converting free-text to coded data for analysis
Building semantic search for biomedical literature
Teaching biomedical informatics principles

❌ Do NOT use when:

- Clinical diagnosis or decision support → Use clinical decision tools
Real-time patient care → Latency too high for acute settings
Replacing expert coding → Use for pre-coding, final review needed
Processing PHI without de-identification → Ensure HIPAA compliance

Integration:

- Upstream: clinical-data-cleaner (data preparation), ehr-semantic-compressor (text extraction)
Downstream: clinical-data-cleaner (SDTM mapping), unstructured-medical-text-miner (NLP pipelines)

Core Capabilities

1. Entity Recognition and Mapping

Extract and map biomedical entities to ontologies:

CODEBLOCK0

Supported Ontologies:

Ontology	Domain	Use Case
SNOMED CT	Clinical	EHR interoperability
MeSH

2. Cross-Ontology Translation

Map concepts between different ontologies:

CODEBLOCK1

Cross-Mapping Coverage:

- SNOMED CT ↔ ICD-10-CM (clinical modifications)
MeSH ↔ SNOMED CT (literature to clinical)
RxNorm ↔ ATC (drug classifications)
LOINC ↔ SNOMED (lab to clinical)

3. Batch Normalization

Process large datasets:

CODEBLOCK2

Performance:

- ~100 terms/second (with caching)
~20 terms/second (API lookup)
Parallel processing for large datasets

4. Confidence Scoring and Validation

Assess mapping reliability:

CODEBLOCK3

Scoring Factors:

- String similarity: Levenshtein distance, n-grams
Context match: Surrounding words alignment
Frequency: Common usage in corpus
Semantic similarity: Vector embeddings

Common Patterns

Pattern 1: Clinical Note Normalization

Scenario: Convert free-text diagnoses to SNOMED codes.

CODEBLOCK4

Post-Processing:

- Review low-confidence mappings (<0.8)
Handle ambiguous terms manually
Validate against clinical context

Pattern 2: Literature Indexing

Scenario: Map research paper keywords to MeSH.

CODEBLOCK5

Pattern 3: Drug Name Normalization

Scenario: Standardize medication names across datasets.

CODEBLOCK6

Pattern 4: EHR Data Harmonization

Scenario: Merge data from multiple hospital systems.

CODEBLOCK7

Complete Workflow Example

From free-text to coded database:

CODEBLOCK8

Quality Checklist

Pre-Mapping:

- [ ] Text preprocessed (lowercase, punctuation handled)
[ ] Abbreviations expanded where possible
[ ] Language identified (multilingual support)

During Mapping:

- [ ] Confidence threshold appropriate (>0.7 for clinical)
[ ] Multiple candidates considered for ambiguous terms
[ ] Context used for disambiguation

Post-Mapping:

- [ ] Low-confidence mappings flagged for review
[ ] Unmapped terms logged
[ ] CRITICAL: Clinical expert validation for high-stakes use

Before Production:

- [ ] Mapping accuracy validated on gold standard
[ ] False positive rate acceptable (<5%)
[ ] Recall acceptable for use case (>90%)
[ ] API rate limits respected

Common Pitfalls

Mapping Errors:

- ❌ Abbreviation ambiguity → "MI" = Myocardial infarction OR Michigan

- ✅ Use context; flag for manual review

- ❌ Outdated terms → Old terminology not in current ontology

- ✅ Use historical mappings; update terminology

- ❌ False confidence → High score for wrong concept

- ✅ Always review top-3 candidates

Technical Issues:

- ❌ API failures → No local fallback

- ✅ Implement caching; use local reference files

- ❌ Version mismatches → Different ontology versions

- ✅ Track ontology version used

- ❌ PHI exposure → Sending patient data to external APIs

- ✅ De-identify before API calls; use local processing when possible

References

Available in references/ directory:

- snomed_ct_guide.md - SNOMED CT hierarchy and relationships
INLINECODE6 - MeSH tree structure and qualifiers
INLINECODE7 - Crosswalks between systems
INLINECODE8 - Biomedical text processing
INLINECODE9 - External service integration
INLINECODE10 - Gold standard test sets

Scripts

Located in scripts/ directory:

- main.py - CLI interface for mapping
INLINECODE13 - Core ontology mapping engine
INLINECODE14 - Named entity recognition
INLINECODE15 - Ontology-to-ontology translation
INLINECODE16 - Confidence calculation
INLINECODE17 - Large dataset handling
INLINECODE18 - Mapping quality checks
INLINECODE19 - Local storage for frequent lookups

Limitations

- Ambiguity: Many-to-many mappings common; context required
Coverage: Rare diseases and new concepts may not be in ontologies
Versioning: Ontology updates can change mappings over time
Language: Best support for English; other languages limited
Real-time: Not suitable for time-critical clinical applications
API Dependency: Requires internet for most lookups (caching helps)

⚠️ Critical: Ontology mapping is for research and data integration, not clinical decision-making. Always validate mappings with domain experts before use in patient care contexts. Never process PHI without appropriate de-identification and compliance measures.

Parameters

Parameter	Type	Default	Description
INLINECODE20	str	Required	Single term to map
INLINECODE21

生物本体映射器

概述

生物医学术语标准化工具，可将自由文本的临床和科学概念映射到标准化本体，实现语义互操作和数据协调。

核心能力：

- 多本体支持：SNOMED CT、MeSH、ICD-10、LOINC、RxNorm
实体提取：针对疾病、症状、手术、药物的命名实体识别
模糊匹配：处理拼写错误、缩写和同义词
置信度评分：每次映射的可靠性指标
批量处理：高效标准化大型数据集
交叉映射：在本体系统之间进行转换

使用场景

✅ 适用场景：

- 标准化临床记录以集成到电子健康档案
为多中心研究标准化术语
将遗留数据映射到现代本体
为临床数据仓库准备数据
将自由文本转换为编码数据用于分析
构建生物医学文献的语义搜索
教授生物信息学原理

❌ 不适用场景：

- 临床诊断或决策支持 → 应使用临床决策工具
实时患者护理 → 对急性场景延迟过高
替代专家编码 → 用于预编码，需最终审核
处理未去标识化的受保护健康信息 → 确保符合HIPAA法规

集成：

- 上游：clinical-data-cleaner（数据准备）、ehr-semantic-compressor（文本提取）
下游：clinical-data-cleaner（SDTM映射）、unstructured-medical-text-miner（自然语言处理流水线）

核心功能

1. 实体识别与映射

提取生物医学实体并映射到本体：

python
from scripts.mapper import BioOntologyMapper

mapper = BioOntologyMapper()

映射临床文本

result = mapper.map_text( text=患者患有糖尿病和高血压，正在服用二甲双胍, ontologies=[snomed, mesh, rxnorm], confidence_threshold=0.7 )

for entity in result.entities:
print(f{entity.text} → {entity.concept_id} ({entity.ontology}))
print(f 首选术语：{entity.preferred_term})
print(f 置信度：{entity.confidence:.2f})

支持的本体：

本体	领域	使用场景
SNOMED CT	临床	电子健康档案互操作
MeSH

文献 | PubMed索引 |
| ICD-10 | 计费 | 诊断编码 |
| LOINC | 检验 | 检测结果标准化 |
| RxNorm | 药物 | 药物名称标准化 |
| HGNC | 基因 | 基因名称标准化 |

2. 跨本体翻译

在不同本体之间映射概念：

python

将SNOMED交叉映射到ICD-10

translation = mapper.cross_map(
source_id=22298006, # SNOMED：心肌梗死
source_ontology=snomed,
target_ontology=icd10
)

print(fICD-10：{translation.targetid} - {translation.targetterm})

输出：I21.9 - 急性心肌梗死，未特指

交叉映射覆盖范围：

- SNOMED CT ↔ ICD-10-CM（临床修订版）
MeSH ↔ SNOMED CT（文献到临床）
RxNorm ↔ ATC（药物分类）
LOINC ↔ SNOMED（检验到临床）

3. 批量标准化

处理大型数据集：

python

批量处理CSV

results = mapper.batch_map(
inputfile=clinicalterms.csv,
textcolumn=diagnosisdescription,
ontologies=[snomed, icd10],
output_format=csv,
max_workers=4
)

结果包括：

- 原始术语

- 映射的概念ID

- 置信度分数

- 备选映射（如有歧义）

性能：

- 约100个术语/秒（使用缓存）
约20个术语/秒（API查询）
大型数据集支持并行处理

4. 置信度评分与验证

评估映射可靠性：

python
scoring = mapper.score_mapping(
term=心脏病发作,
candidate=22298006, # 心肌梗死
factors=[stringsimilarity, contextmatch, frequency]
)

print(f总体置信度：{scoring.confidence:.2f})
print(f分解：{scoring.factors})

评分因素：

- 字符串相似度：莱文斯坦距离、n-gram
上下文匹配：周围词语对齐
频率：语料库中的常见用法
语义相似度：向量嵌入

常见模式

模式1：临床记录标准化

场景：将自由文本诊断转换为SNOMED编码。

bash

标准化临床记录

python scripts/main.py \
--input notes.csv \
--column diagnosis_text \
--ontology snomed \
--threshold 0.8 \
--output coded_diagnoses.csv

结果：心脏病发作 → 22298006（心肌梗死）

后处理：

- 审查低置信度映射（<0.8）
手动处理歧义术语
根据临床上下文验证

模式2：文献索引

场景：将研究论文关键词映射到MeSH。

python

将关键词映射到MeSH

meshterms = mapper.mapto_mesh(
keywords=[癌症免疫治疗, 检查点抑制剂, PD-1],
includetreenumbers=True,
include_qualifiers=True
)

for term in mesh_terms:
print(f{term.input} → {term.descriptor})
print(f 树状编号：{term.tree_numbers})
print(f 入口词：{term.synonyms})

模式3：药物名称标准化

场景：跨数据集标准化药物名称。

python

标准化药物名称

drugs = [泰诺, 雅维, 美林, 对乙酰氨基酚]

for drug in drugs:
result = mapper.maptorxnorm(drug)
print(f{drug} → {result.rxcui}：{result.name})
# 泰诺 → 161：对乙酰氨基酚
# 雅维 → 5640：布洛芬
# 美林 → 5640：布洛芬

模式4：电子健康档案数据协调

场景：合并来自多个医院系统的数据。

bash

协调3家医院的诊断数据

python scripts/main.py \
--batch \
--inputs hospitala.csv,hospitalb.csv,hospital_c.csv \
--target-ontology snomed \
--cross-map-to icd10 \
--output harmonized_data.csv

完整工作流示例

从自由文本到编码数据库：

python
from scripts.mapper import BioOntologyMapper
from scripts.validator import MappingValidator

初始化

mapper = BioOntologyMapper() validator = MappingValidator()

步骤1：从文本中提取实体

clinical_note = 患者患有2型糖尿病和高血压... entities = mapper.extractentities(clinicalnote)

步骤2：映射到SNOMED

mappings = [] for entity in entities: mapping = mapper.maptosnomed( entity.text, context=clinical_note, top_n=3 ) mappings.append(mapping)

步骤3：验证映射

for mapping in mappings: validation = validator.validate( mapping, checkclinicalplausibility=True ) if not validation.is_valid: print(f需要审查：{mapping})

步骤4：导出为数据库格式

dbrecords = [m.todatabase_record() for m in mappings]

质量检查清单

映射前：

- [ ] 文本预处理（小写、标点处理）
[ ] 尽可能展开缩写
[ ] 识别语言（多语言支持）

映射过程中：

- [ ] 置信度阈值适当（临床场景>0.7）
[ ] 歧义术语考虑多个候选
[ ] 使用上下文进行消歧

映射后：

- [ ] 低置信度映射标记待审查
[ ] 未映射术语记录日志
[ ] 关键：高风险场景需临床专家验证

投入生产前：

- [ ] 映射准确性在黄金标准上验证
[ ] 假阳性率可接受（<5%）
[ ] 召回率满足用例需求（>90%）
[ ] 遵守API速率限制

常见陷阱

映射错误：

- ❌ 缩写歧义 → MI = 心肌梗死或密歇根州

- ✅ 使用上下文；标记待人工审查

bio-ontology-mapper生物本体映射