EHR Semantic Compressor
Overview
AI-powered EHR summarization using Transformer architecture to extract key clinical information from lengthy medical records. This skill processes lengthy Electronic Health Record (EHR) documents and generates structured, clinically accurate summaries.
Technical Difficulty: High
When to Use
- - Input contains lengthy EHR documents (1600+ words) requiring summarization
- Clinical records need structured extraction of key information
- Quick review of patient history, medications, allergies, or diagnoses is needed
- Medical documentation requires compression while maintaining accuracy
Core Features
- 1. Fast Processing: Process lengthy EHR documents (1600+ words) in 10-20 seconds
- Structured Summaries: Generate bullet-point summaries (200-300 words)
- Critical Information Extraction:
- Patient allergies and adverse reactions
- Family medical history
- Current and past medications
- Diagnoses and conditions
- Vital signs and lab results
- Procedures and surgeries
- 4. Clinical Accuracy: Maintains completeness of medical information
Usage
Basic Usage
CODEBLOCK0
Input Format
CODEBLOCK1
Output Format
CODEBLOCK2
Parameters
| Parameter | Type | Default | Required | Description |
|---|
| INLINECODE0 , INLINECODE1 | string | - | Yes | Input EHR document text file path |
| INLINECODE2 , INLINECODE3 |
string | - | No | Output JSON file path |
|
--max-length | int | 300 | No | Maximum summary length in words |
|
--extract-sections | string | all | No | Comma-separated sections to extract |
|
--format | string | json | No | Output format (json, markdown, text) |
Technical Details
Architecture
- - Base Model: Transformer-based encoder-decoder architecture
- Medical Domain Adaptation: Fine-tuned on clinical text corpora
- Section Extraction: Rule-based + ML hybrid approach for structured data
- Processing Pipeline: Text segmentation -> Summarization -> Section extraction -> Output formatting
Dependencies
See references/requirements.txt for complete list.
Key dependencies:
- - transformers >= 4.30.0
- torch >= 2.0.0
- spacy >= 3.6.0
- scispacy >= 0.5.3
Performance
- - Processing Time: 10-20 seconds for 1600+ word documents
- Memory: Requires ~2GB RAM
- Output Length: 200-300 words (configurable)
- Compression Ratio: ~85-90%
References
- -
references/requirements.txt - Python dependencies - INLINECODE9 - Clinical summarization guidelines
- INLINECODE10 - Example input format
- INLINECODE11 - Example output format
Safety & Compliance
- - No external API calls or service dependencies
- All processing performed locally
- No patient data transmitted outside the system
- Error messages are semantic and do not expose technical details
Testing
Run unit tests:
CODEBLOCK3
Error Handling
All errors return semantic messages:
CODEBLOCK4
Risk Assessment
| Risk Indicator | Assessment | Level |
|---|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access |
No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
Security Checklist
- - [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
Prerequisites
CODEBLOCK5
Evaluation Criteria
Success Metrics
- - [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
Test Cases
- 1. Basic Functionality: Standard input → Expected output
- Edge Case: Invalid input → Graceful error handling
- Performance: Large dataset → Acceptable processing time
Lifecycle Status
- - Current Stage: Draft
- Next Review Date: 2026-03-06
- Known Issues: None
- Planned Improvements:
- Performance optimization
- Additional feature support
EHR 语义压缩器
概述
基于人工智能的EHR摘要生成工具,采用Transformer架构从冗长医疗记录中提取关键临床信息。该技能可处理长篇电子健康记录(EHR)文档,并生成结构化、临床准确的摘要。
技术难度:高
使用场景
- - 输入包含需要摘要的长篇EHR文档(1600字以上)
- 需要结构化提取临床记录中的关键信息
- 需要快速查阅患者病史、用药、过敏史或诊断信息
- 需要在保持准确性的同时压缩医疗文档
核心功能
- 1. 快速处理:10-20秒内处理长篇EHR文档(1600字以上)
- 结构化摘要:生成要点式摘要(200-300字)
- 关键信息提取:
- 患者过敏史及不良反应
- 家族病史
- 当前及既往用药
- 诊断与病症
- 生命体征与实验室检查结果
- 手术与操作
- 4. 临床准确性:保持医疗信息的完整性
使用方法
基本用法
bash
python scripts/main.py --input ehr_document.txt --output summary.json
输入格式
json
{
ehr_text: 完整EHR文档文本...,
max_length: 300,
extractsections: [allergies, medications, diagnoses, familyhistory]
}
输出格式
json
{
status: success,
data: {
summary: 结构化要点摘要...,
extracted_sections: {
allergies: [...],
medications: [...],
diagnoses: [...],
family_history: [...]
},
metadata: {
original_length: 2500,
summary_length: 280,
compression_ratio: 0.89
}
}
}
参数说明
| 参数 | 类型 | 默认值 | 必填 | 描述 |
|---|
| --input, -i | string | - | 是 | 输入EHR文档文本文件路径 |
| --output, -o |
string | - | 否 | 输出JSON文件路径 |
| --max-length | int | 300 | 否 | 摘要最大字数 |
| --extract-sections | string | all | 否 | 需提取的章节(逗号分隔) |
| --format | string | json | 否 | 输出格式(json、markdown、text) |
技术细节
架构
- - 基础模型:基于Transformer的编码器-解码器架构
- 医学领域适配:在临床文本语料库上进行微调
- 章节提取:规则+机器学习混合方法处理结构化数据
- 处理流程:文本分割 -> 摘要生成 -> 章节提取 -> 输出格式化
依赖项
完整列表请参见 references/requirements.txt。
主要依赖项:
- - transformers >= 4.30.0
- torch >= 2.0.0
- spacy >= 3.6.0
- scispacy >= 0.5.3
性能指标
- - 处理时间:1600字以上文档需10-20秒
- 内存:约需2GB RAM
- 输出长度:200-300字(可配置)
- 压缩率:约85-90%
参考资料
- - references/requirements.txt - Python依赖项
- references/guidelines.md - 临床摘要指南
- references/sampleinput.json - 输入格式示例
- references/sampleoutput.json - 输出格式示例
安全与合规
- - 无外部API调用或服务依赖
- 所有处理均在本地完成
- 无患者数据传输至系统外
- 错误信息为语义化,不暴露技术细节
测试
运行单元测试:
bash
cd scripts
python test_main.py
错误处理
所有错误均返回语义化信息:
json
{
status: error,
error: {
type: inputvalidationerror,
message: EHR文本为空或过短,
suggestion: 请提供至少100字的EHR文本
}
}
风险评估
| 风险指标 | 评估 | 等级 |
|---|
| 代码执行 | Python/R脚本本地执行 | 中 |
| 网络访问 |
无外部API调用 | 低 |
| 文件系统访问 | 读取输入文件,写入输出文件 | 中 |
| 指令篡改 | 标准提示指南 | 低 |
| 数据泄露 | 输出文件保存至工作区 | 低 |
安全检查清单
- - [ ] 无硬编码凭据或API密钥
- [ ] 无未授权文件系统访问(../)
- [ ] 输出不暴露敏感信息
- [ ] 已实施提示注入防护
- [ ] 输入文件路径已验证(无../遍历)
- [ ] 输出目录限制在工作区
- [ ] 脚本在沙盒环境中执行
- [ ] 错误信息已清理(不暴露堆栈跟踪)
- [ ] 依赖项已审计
前置条件
bash
Python依赖项
pip install -r requirements.txt
评估标准
成功指标
- - [ ] 成功执行主要功能
- [ ] 输出符合质量标准
- [ ] 优雅处理边缘情况
- [ ] 性能可接受
测试用例
- 1. 基本功能:标准输入 → 预期输出
- 边缘情况:无效输入 → 优雅错误处理
- 性能:大数据集 → 可接受处理时间
生命周期状态
- - 当前阶段:草稿
- 下次审核日期:2026-03-06
- 已知问题:无
- 计划改进:
- 性能优化
- 新增功能支持