Survival Analysis (Kaplan-Meier)
Kaplan-Meier survival analysis tool for clinical and biological research. Generates publication-ready survival curves with statistical tests.
Features
- - Kaplan-Meier Curve Generation: Publication-quality survival plots with confidence intervals
- Statistical Tests: Log-rank test, Wilcoxon test, Peto-Peto test
- Hazard Ratios: Cox proportional hazards regression with 95% CI
- Summary Statistics: Median survival time, restricted mean survival time (RMST)
- Multi-group Analysis: Supports 2+ comparison groups
- Risk Tables: Optional at-risk table below curves
Usage
Python Script
CODEBLOCK0
Arguments
| Argument | Description | Required |
|---|
| INLINECODE0 | Input CSV file path | Yes |
| INLINECODE1 |
Column name for survival time | Yes |
|
--event | Column name for event indicator (1=event, 0=censored) | Yes |
|
--group | Column name for grouping variable | Optional |
|
--output | Output directory for results | Yes |
|
--conf-level | Confidence level (default: 0.95) | Optional |
|
--risk-table | Include risk table in plot | Optional |
Input Format
CSV with columns:
- - Time column: Numeric, time to event or censoring
- Event column: Binary (1 = event occurred, 0 = censored/right-censored)
- Group column: Categorical variable for stratification
Example:
CODEBLOCK1
Output Files
- -
km_curve.png: Kaplan-Meier survival curve - INLINECODE8 : Vector version for publications
- INLINECODE9 : Statistical summary (median survival, confidence intervals)
- INLINECODE10 : Cox regression results with HR and 95% CI
- INLINECODE11 report.txt: Human-readable summary report
Technical Details
Statistical Methods
- 1. Kaplan-Meier Estimator: Non-parametric maximum likelihood estimate of survival function
- Product-limit estimator: Ŝ(t) = Π(tᵢ≤t) (1 - dᵢ/nᵢ)
- Greenwood's formula for variance estimation
- 2. Log-Rank Test: Most widely used test for comparing survival curves
- Null hypothesis: No difference between groups
- Weighted by number at risk at each event time
- 3. Cox Proportional Hazards: Semi-parametric regression model
- h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ...)
- Proportional hazards assumption checked via Schoenfeld residuals
Dependencies
- -
lifelines: Core survival analysis library - INLINECODE13 ,
seaborn: Visualization - INLINECODE15 ,
numpy: Data handling - INLINECODE17 : Statistical tests
Technical Difficulty: High ⚠️
This skill involves advanced statistical modeling. Results should be reviewed by a biostatistician, especially for:
- - Proportional hazards assumption violations
- Small sample sizes (< 30 per group)
- Heavy censoring (> 50%)
- Time-varying covariates
References
See references/ folder for:
- - Kaplan EL, Meier P (1958) original paper
- Cox DR (1972) regression models paper
- Sample datasets for testing
- Clinical reporting guidelines (ATN, CONSORT)
Parameters
| Parameter | Type | Default | Description |
|---|
| INLINECODE19 | str | Required | Input CSV file path |
| INLINECODE20 |
str | Required | Column name for survival time |
|
--event | str | Required | |
|
--group | str | Required | |
|
--output | str | Required | Output directory for results |
|
--conf-level | float | 0.95 | |
|
--risk-table | str | Required | Include risk table in plot |
|
--figsize | str | '10 | |
|
--dpi | int | 300 | |
Example
CODEBLOCK2
Output includes:
- - Survival curves with 95% confidence bands
- Median survival: Drug A = 28.4 months (95% CI: 24.1-32.7), Placebo = 18.2 months (95% CI: 15.3-21.1)
- Log-rank test p-value: 0.0023
- Hazard ratio: 0.62 (95% CI: 0.45-0.85), p = 0.003
Risk Assessment
| Risk Indicator | Assessment | Level |
|---|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access |
No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
Security Checklist
- - [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
Prerequisites
CODEBLOCK3
Evaluation Criteria
Success Metrics
- - [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
Test Cases
- 1. Basic Functionality: Standard input → Expected output
- Edge Case: Invalid input → Graceful error handling
- Performance: Large dataset → Acceptable processing time
Lifecycle Status
- - Current Stage: Draft
- Next Review Date: 2026-03-06
- Known Issues: None
- Planned Improvements:
- Performance optimization
- Additional feature support
生存分析(Kaplan-Meier)
Kaplan-Meier生存分析工具,适用于临床和生物学研究。生成可直接用于发表的生存曲线,并附带统计检验。
功能特点
- - Kaplan-Meier曲线生成:出版级质量的生存图,含置信区间
- 统计检验:对数秩检验、Wilcoxon检验、Peto-Peto检验
- 风险比:Cox比例风险回归模型,含95%置信区间
- 汇总统计:中位生存时间、限制性平均生存时间(RMST)
- 多组分析:支持2个及以上比较组
- 风险表:曲线下方可选显示风险表
使用方法
Python脚本
bash
python scripts/main.py --input data.csv --time timecol --event eventcol --group group_col --output results/
参数说明
| 参数 | 描述 | 是否必需 |
|---|
| --input | 输入CSV文件路径 | 是 |
| --time |
生存时间列名 | 是 |
| --event | 事件指示列名(1=事件发生,0=删失) | 是 |
| --group | 分组变量列名 | 可选 |
| --output | 结果输出目录 | 是 |
| --conf-level | 置信水平(默认:0.95) | 可选 |
| --risk-table | 在图中包含风险表 | 可选 |
输入格式
CSV文件需包含以下列:
- - 时间列:数值型,事件发生或删失时间
- 事件列:二值型(1 = 事件发生,0 = 删失/右删失)
- 分组列:用于分层的分类变量
示例:
csv
patientid,timemonths,death,treatment_group
P001,24.5,1,Drug_A
P002,36.2,0,Drug_A
P003,18.7,1,Placebo
输出文件
- - kmcurve.png:Kaplan-Meier生存曲线
- kmcurve.pdf:用于出版的矢量版本
- survivalstats.csv:统计汇总(中位生存时间、置信区间)
- hazardratios.csv:Cox回归结果,含HR和95%置信区间
- logrank_test.csv:两两比较的p值
- report.txt:可读的汇总报告
技术细节
统计方法
- 1. Kaplan-Meier估计量:生存函数的非参数最大似然估计
- 乘积限估计量:Ŝ(t) = Π(tᵢ≤t) (1 - dᵢ/nᵢ)
- Greenwood方差估计公式
- 2. 对数秩检验:比较生存曲线最常用的检验方法
- 原假设:各组间无差异
- 按每个事件时间点的风险人数进行加权
- 3. Cox比例风险模型:半参数回归模型
- h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ...)
- 通过Schoenfeld残差检验比例风险假设
依赖库
- - lifelines:核心生存分析库
- matplotlib、seaborn:可视化
- pandas、numpy:数据处理
- scipy:统计检验
技术难度:高 ⚠️
本技能涉及高级统计建模。结果应由生物统计学家审核,特别关注以下方面:
- - 比例风险假设违反情况
- 小样本量(每组<30例)
- 高删失率(>50%)
- 时变协变量
参考文献
详见 references/ 文件夹:
- - Kaplan EL, Meier P (1958) 原始论文
- Cox DR (1972) 回归模型论文
- 用于测试的样本数据集
- 临床报告指南(ATN、CONSORT)
参数说明
| 参数 | 类型 | 默认值 | 描述 |
|---|
| --input | str | 必需 | 输入CSV文件路径 |
| --time |
str | 必需 | 生存时间列名 |
| --event | str | 必需 | 事件指示列名 |
| --group | str | 必需 | 分组变量列名 |
| --output | str | 必需 | 结果输出目录 |
| --conf-level | float | 0.95 | 置信水平 |
| --risk-table | str | 必需 | 在图中包含风险表 |
| --figsize | str | 10,8 | 图形尺寸(宽,高) |
| --dpi | int | 300 | 图形分辨率 |
示例
bash
基本生存曲线
python scripts/main.py \
--input clinical_data.csv \
--time overall
survivalmonths \
--event death \
--group treatment_arm \
--output ./results/ \
--risk-table
输出包括:
- - 含95%置信带的生存曲线
- 中位生存时间:药物A = 28.4个月(95% CI: 24.1-32.7),安慰剂 = 18.2个月(95% CI: 15.3-21.1)
- 对数秩检验p值:0.0023
- 风险比:0.62(95% CI: 0.45-0.85),p = 0.003
风险评估
| 风险指标 | 评估 | 等级 |
|---|
| 代码执行 | 本地执行Python/R脚本 | 中 |
| 网络访问 |
无外部API调用 | 低 |
| 文件系统访问 | 读取输入文件,写入输出文件 | 中 |
| 指令篡改 | 标准提示词指南 | 低 |
| 数据泄露 | 输出文件保存到工作区 | 低 |
安全检查清单
- - [ ] 无硬编码凭据或API密钥
- [ ] 无未经授权的文件系统访问(../)
- [ ] 输出不泄露敏感信息
- [ ] 已实施提示注入防护
- [ ] 输入文件路径已验证(无../遍历)
- [ ] 输出目录限制在工作区内
- [ ] 脚本在沙盒环境中执行
- [ ] 错误信息已清理(不暴露堆栈跟踪)
- [ ] 依赖库已审计
前置条件
bash
Python依赖
pip install -r requirements.txt
评估标准
成功指标
- - [ ] 成功执行主要功能
- [ ] 输出符合质量标准
- [ ] 优雅处理边界情况
- [ ] 性能可接受
测试用例
- 1. 基本功能:标准输入 → 预期输出
- 边界情况:无效输入 → 优雅的错误处理
- 性能测试:大数据集 → 可接受的处理时间
生命周期状态
- - 当前阶段:草案
- 下次审核日期:2026-03-06
- 已知问题:无
- 计划改进:
- 性能优化
- 增加更多功能支持