Clinical Data Cleaner
Clean, validate, and standardize clinical trial data to meet CDISC SDTM standards for regulatory submissions to FDA or EMA.
Quick Start
CODEBLOCK0
Core Capabilities
1. SDTM Domain Validation
CODEBLOCK1
Required Fields:
- - DM: STUDYID, USUBJID, SUBJID, RFSTDTC, RFENDTC, SITEID, AGE, SEX, RACE
- LB: STUDYID, USUBJID, LBTESTCD, LBCAT, LBORRES, LBORRESU, LBSTRESC, LBDTC
- VS: STUDYID, USUBJID, VSTESTCD, VSORRES, VSORRESU, VSSTRESC, VSDTC
2. Missing Value Handling
CODEBLOCK2
3. Outlier Detection
CODEBLOCK3
Clinical Thresholds:
| Parameter | Range | Unit |
|---|
| Glucose | 50-500 | mg/dL |
| Hemoglobin |
5-20 | g/dL |
| Systolic BP | 70-220 | mmHg |
4. Date Standardization
CODEBLOCK4
5. Complete Pipeline
CODEBLOCK5
Output Files:
- -
output.csv - Cleaned SDTM data - INLINECODE1 - Audit trail for regulatory submission
CLI Usage
CODEBLOCK6
Common Patterns
See references/common-patterns.md for detailed examples:
- - Regulatory Submission Preparation
- Interim Analysis Data Preparation
- Database Migration Cleanup
- External Lab Data Integration
Troubleshooting
See references/troubleshooting.md for solutions to:
- - Validation failures
- Date parsing errors
- Memory errors with large datasets
- Outlier detection issues
Quality Checklist
Pre-Cleaning:
- - [ ] IACUC approval obtained (animal studies)
- [ ] Sample size adequately powered
- [ ] Randomization method documented
Post-Cleaning:
- - [ ] Validate against CDISC SDTM IG
- [ ] Review all cleaning actions in audit trail
- [ ] Test import to analysis software
References
- -
references/sdtm_ig_guide.md - CDISC SDTM Implementation Guide - INLINECODE3 - Domain-specific field requirements
- INLINECODE4 - Clinical outlier thresholds
- INLINECODE5 - Detailed usage patterns
- INLINECODE6 - Problem-solving guide
Skill ID: 189 |
Version: 2.0 |
License: MIT
临床数据清洗器
清理、验证并标准化临床试验数据,使其符合CDISC SDTM标准,以便向FDA或EMA提交监管申请。
快速开始
python
from scripts.main import ClinicalDataCleaner
初始化人口学领域
cleaner = ClinicalDataCleaner(domain=DM)
使用默认设置清洗数据
cleaned = cleaner.clean(raw_data)
保存并附带审计追踪
cleaner.save_report(output.csv)
核心功能
1. SDTM领域验证
python
cleaner = ClinicalDataCleaner(domain=DM) # 或 LB, VS
isvalid, missing = cleaner.validatedomain(data)
必填字段:
- - DM:STUDYID、USUBJID、SUBJID、RFSTDTC、RFENDTC、SITEID、AGE、SEX、RACE
- LB:STUDYID、USUBJID、LBTESTCD、LBCAT、LBORRES、LBORRESU、LBSTRESC、LBDTC
- VS:STUDYID、USUBJID、VSTESTCD、VSORRES、VSORRESU、VSSTRESC、VSDTC
2. 缺失值处理
python
cleaner = ClinicalDataCleaner(
domain=DM,
missing_strategy=median # mean、median、mode、forward、drop
)
cleaned = cleaner.handlemissingvalues(data)
3. 异常值检测
python
cleaner = ClinicalDataCleaner(
domain=LB,
outlier_method=domain, # iqr、zscore、domain
outlier_action=flag # flag、remove、cap
)
flagged = cleaner.detect_outliers(data)
临床阈值:
5-20 | g/dL |
| 收缩压 | 70-220 | mmHg |
4. 日期标准化
python
standardized = cleaner.standardize_dates(data)
转换为ISO 8601格式:2023-01-15T09:30:00
5. 完整流程
python
cleaner = ClinicalDataCleaner(
domain=DM,
missing_strategy=median,
outlier_method=iqr,
outlier_action=flag
)
cleaned_data = cleaner.clean(data)
cleaner.save_report(output.csv)
输出文件:
- - output.csv - 清洗后的SDTM数据
- output.report.json - 用于监管提交的审计追踪
命令行使用
bash
清洗人口学数据
python scripts/main.py \
--input dm_raw.csv \
--domain DM \
--output dm_clean.csv \
--missing-strategy median \
--outlier-method iqr \
--outlier-action flag
使用临床阈值清洗实验室数据
python scripts/main.py \
--input lb_raw.csv \
--domain LB \
--output lb_clean.csv \
--outlier-method domain
常见模式
参见 references/common-patterns.md 获取详细示例:
- - 监管申报准备
- 中期分析数据准备
- 数据库迁移清理
- 外部实验室数据整合
故障排除
参见 references/troubleshooting.md 获取以下问题的解决方案:
- - 验证失败
- 日期解析错误
- 大数据集内存错误
- 异常值检测问题
质量检查清单
清洗前:
- - [ ] 获得IACUC批准(动物研究)
- [ ] 样本量具有足够统计效力
- [ ] 随机化方法已记录
清洗后:
- - [ ] 对照CDISC SDTM IG进行验证
- [ ] 在审计追踪中审查所有清洗操作
- [ ] 测试导入分析软件
参考资料
- - references/sdtmigguide.md - CDISC SDTM实施指南
- references/domainspecs.json - 领域特定字段要求
- references/outlierthresholds.json - 临床异常值阈值
- references/common-patterns.md - 详细使用模式
- references/troubleshooting.md - 问题解决指南
技能ID:189 |
版本:2.0 |
许可证:MIT