EHR Semantic Compressor

Overview

AI-powered EHR summarization using Transformer architecture to extract key clinical information from lengthy medical records. This skill processes lengthy Electronic Health Record (EHR) documents and generates structured, clinically accurate summaries.

Technical Difficulty: High

When to Use

- Input contains lengthy EHR documents (1600+ words) requiring summarization
Clinical records need structured extraction of key information
Quick review of patient history, medications, allergies, or diagnoses is needed
Medical documentation requires compression while maintaining accuracy

Core Features

1. Fast Processing: Process lengthy EHR documents (1600+ words) in 10-20 seconds
Structured Summaries: Generate bullet-point summaries (200-300 words)
Critical Information Extraction:

- Patient allergies and adverse reactions - Family medical history - Current and past medications - Diagnoses and conditions - Vital signs and lab results - Procedures and surgeries

4. Clinical Accuracy: Maintains completeness of medical information

Usage

Basic Usage

CODEBLOCK0

Input Format

CODEBLOCK1

Output Format

CODEBLOCK2

Parameters

Parameter	Type	Default	Required	Description
INLINECODE0, INLINECODE1	string	-	Yes	Input EHR document text file path
INLINECODE2, INLINECODE3

string | - | No | Output JSON file path | | --max-length | int | 300 | No | Maximum summary length in words | | --extract-sections | string | all | No | Comma-separated sections to extract | | --format | string | json | No | Output format (json, markdown, text) |

Technical Details

Architecture

- Base Model: Transformer-based encoder-decoder architecture
Medical Domain Adaptation: Fine-tuned on clinical text corpora
Section Extraction: Rule-based + ML hybrid approach for structured data
Processing Pipeline: Text segmentation -> Summarization -> Section extraction -> Output formatting

Dependencies

See references/requirements.txt for complete list.

Key dependencies:

- transformers >= 4.30.0
torch >= 2.0.0
spacy >= 3.6.0
scispacy >= 0.5.3

Performance

- Processing Time: 10-20 seconds for 1600+ word documents
Memory: Requires ~2GB RAM
Output Length: 200-300 words (configurable)
Compression Ratio: ~85-90%

References

- references/requirements.txt - Python dependencies
INLINECODE9 - Clinical summarization guidelines
INLINECODE10 - Example input format
INLINECODE11 - Example output format

Safety & Compliance

- No external API calls or service dependencies
All processing performed locally
No patient data transmitted outside the system
Error messages are semantic and do not expose technical details

Testing

Run unit tests:
CODEBLOCK3

Error Handling

All errors return semantic messages:

CODEBLOCK4

Risk Assessment

Risk Indicator	Assessment	Level
Code Execution	Python/R scripts executed locally	Medium
Network Access

Security Checklist

- [ ] No hardcoded credentials or API keys
[ ] No unauthorized file system access (../)
[ ] Output does not expose sensitive information
[ ] Prompt injection protections in place
[ ] Input file paths validated (no ../ traversal)
[ ] Output directory restricted to workspace
[ ] Script execution in sandboxed environment
[ ] Error messages sanitized (no stack traces exposed)
[ ] Dependencies audited

Prerequisites

CODEBLOCK5

Evaluation Criteria

Success Metrics

- [ ] Successfully executes main functionality
[ ] Output meets quality standards
[ ] Handles edge cases gracefully
[ ] Performance is acceptable

Test Cases

1. Basic Functionality: Standard input → Expected output
Edge Case: Invalid input → Graceful error handling
Performance: Large dataset → Acceptable processing time

Lifecycle Status

- Current Stage: Draft
Next Review Date: 2026-03-06
Known Issues: None
Planned Improvements:

- Performance optimization - Additional feature support

EHR 语义压缩器

概述

基于人工智能的EHR摘要生成工具，采用Transformer架构从冗长医疗记录中提取关键临床信息。该技能可处理长篇电子健康记录（EHR）文档，并生成结构化、临床准确的摘要。

技术难度：高

使用场景

- 输入包含需要摘要的长篇EHR文档（1600字以上）
需要结构化提取临床记录中的关键信息
需要快速查阅患者病史、用药、过敏史或诊断信息
需要在保持准确性的同时压缩医疗文档

核心功能

1. 快速处理：10-20秒内处理长篇EHR文档（1600字以上）
结构化摘要：生成要点式摘要（200-300字）
关键信息提取：

- 患者过敏史及不良反应 - 家族病史 - 当前及既往用药 - 诊断与病症 - 生命体征与实验室检查结果 - 手术与操作

4. 临床准确性：保持医疗信息的完整性

使用方法

基本用法

bash
python scripts/main.py --input ehr_document.txt --output summary.json

输入格式

json
{
ehr_text: 完整EHR文档文本...,
max_length: 300,
extractsections: [allergies, medications, diagnoses, familyhistory]
}

输出格式

json
{
status: success,
data: {
summary: 结构化要点摘要...,
extracted_sections: {
allergies: [...],
medications: [...],
diagnoses: [...],
family_history: [...]
},
metadata: {
original_length: 2500,
summary_length: 280,
compression_ratio: 0.89
}
}
}

参数说明

参数	类型	默认值	必填	描述
--input, -i	string	-	是	输入EHR文档文本文件路径
--output, -o

string | - | 否 | 输出JSON文件路径 | | --max-length | int | 300 | 否 | 摘要最大字数 | | --extract-sections | string | all | 否 | 需提取的章节（逗号分隔） | | --format | string | json | 否 | 输出格式（json、markdown、text） |

技术细节

架构

- 基础模型：基于Transformer的编码器-解码器架构
医学领域适配：在临床文本语料库上进行微调
章节提取：规则+机器学习混合方法处理结构化数据
处理流程：文本分割 -> 摘要生成 -> 章节提取 -> 输出格式化

依赖项

完整列表请参见 references/requirements.txt。

主要依赖项：

- transformers >= 4.30.0
torch >= 2.0.0
spacy >= 3.6.0
scispacy >= 0.5.3

性能指标

- 处理时间：1600字以上文档需10-20秒
内存：约需2GB RAM
输出长度：200-300字（可配置）
压缩率：约85-90%

参考资料

- references/requirements.txt - Python依赖项
references/guidelines.md - 临床摘要指南
references/sampleinput.json - 输入格式示例
references/sampleoutput.json - 输出格式示例

安全与合规

- 无外部API调用或服务依赖
所有处理均在本地完成
无患者数据传输至系统外
错误信息为语义化，不暴露技术细节

测试

运行单元测试：
bash
cd scripts
python test_main.py

错误处理

所有错误均返回语义化信息：

json
{
status: error,
error: {
type: inputvalidationerror,
message: EHR文本为空或过短,
suggestion: 请提供至少100字的EHR文本
}
}

风险评估

风险指标	评估	等级
代码执行	Python/R脚本本地执行	中
网络访问

安全检查清单

- [ ] 无硬编码凭据或API密钥
[ ] 无未授权文件系统访问（../）
[ ] 输出不暴露敏感信息
[ ] 已实施提示注入防护
[ ] 输入文件路径已验证（无../遍历）
[ ] 输出目录限制在工作区
[ ] 脚本在沙盒环境中执行
[ ] 错误信息已清理（不暴露堆栈跟踪）
[ ] 依赖项已审计

前置条件

bash

Python依赖项

pip install -r requirements.txt

评估标准

成功指标

- [ ] 成功执行主要功能
[ ] 输出符合质量标准
[ ] 优雅处理边缘情况
[ ] 性能可接受

测试用例

1. 基本功能：标准输入 → 预期输出
边缘情况：无效输入 → 优雅错误处理
性能：大数据集 → 可接受处理时间

生命周期状态

- 当前阶段：草稿
下次审核日期：2026-03-06
已知问题：无
计划改进：

- 性能优化 - 新增功能支持

ehr-semantic-compressorEHR语义压缩