Research Code Reproducibility Refactoring Tool
Workflow Overview
Follow this sequence when refactoring a research codebase:
- 1. Analyze — identify reproducibility issues in existing code
- Refactor — apply documentation, parameterization, and error handling
- Specify environment — pin dependencies and create environment files
- Validate — run tests and verify behaviour is unchanged
Step 1: Analyze Code for Reproducibility Issues
Read each source file and check for the following problems. Document findings before making any changes.
Checklist: missing docstrings · hardcoded absolute paths · missing random seeds · bare except: clauses · unpinned imports · unexplained magic numbers
Example — detecting issues manually:
CODEBLOCK0
Step 2: Refactor for Best Practices
Apply improvements in place. Always back up originals first.
2a. Add docstrings
CODEBLOCK1
2b. Parameterize hardcoded values
CODEBLOCK2
2c. Set random seeds
CODEBLOCK3
2d. Add error handling and logging
CODEBLOCK4
Step 3: Generate Environment Specifications
See references/environment-setup.md for full Dockerfile and Conda environment templates.
requirements.txt (pip)
CODEBLOCK5
Verify resolution:
CODEBLOCK6
environment.yml (Conda)
CODEBLOCK7
CODEBLOCK8
Step 4: Create Documentation
README structure
Generate a README.md containing at minimum:
CODEBLOCK9 bash
conda env create -f environment.yml
conda activate my-research-env
## Data
<!-- Describe input data format, source, and where to place files -->
## Running the Analysis
bash
python main.py --data data/raw.csv --output results/
## Expected Outputs
<!-- Describe files created and how to interpret them -->
## Reproducing Results
- Random seed: 42 (set in `config.py`)
- Hardware: results validated on CPU; GPU results may differ slightly
Step 5: Validate Reproducibility
After all changes, verify that behaviour is unchanged:
CODEBLOCK12
Reproducibility verification checklist:
- - [ ] Output checksums match pre-refactor baseline
- [ ] All tests pass
- [ ] Pipeline runs twice and produces identical outputs
- [ ]
requirements.txt / environment.yml installs cleanly in a fresh environment - [ ] No absolute paths remain in source files
- [ ] Random seeds are set and documented
- [ ] All public functions have docstrings
- [ ] README contains complete reproduction instructions
Best Practices Summary
| Practice |
|---|
| Relative paths only |
| Pin dependency versions |
| Set random seeds |
| Docstrings on all public functions |
| Validate outputs against a baseline |
| Automate environment setup |
References
- -
references/guide.md — Comprehensive user guide - INLINECODE6 — Dockerfile and full environment templates
- INLINECODE7 — Working code examples
- INLINECODE8 — Complete API documentation
Skill ID: 455 |
Version: 1.0 |
License: MIT
研究代码可复现性重构工具
工作流程概览
重构研究代码库时,请遵循以下顺序:
- 1. 分析 — 识别现有代码中的可复现性问题
- 重构 — 应用文档、参数化和错误处理
- 指定环境 — 锁定依赖并创建环境文件
- 验证 — 运行测试并确认行为未改变
步骤1:分析代码的可复现性问题
阅读每个源文件并检查以下问题。在进行任何更改前,记录发现的问题。
检查清单: 缺少文档字符串 · 硬编码的绝对路径 · 缺少随机种子 · 裸except:子句 · 未锁定版本的导入 · 未解释的魔法数字
示例 — 手动检测问题:
python
import ast, pathlib
def findhardcodedpaths(source: str) -> list[str]:
返回看起来像绝对路径的字符串字面量。
tree = ast.parse(source)
return [
node.s for node in ast.walk(tree)
if isinstance(node, ast.Constant)
and isinstance(node.s, str)
and node.s.startswith(/)
]
source = pathlib.Path(analysis.py).read_text()
print(findhardcodedpaths(source))
步骤2:按最佳实践进行重构
就地应用改进。务必先备份原始文件。
2a. 添加文档字符串
python
之前
def load_data(path):
import pandas as pd
return pd.read_csv(path)
之后
def load_data(path: str) -> pd.DataFrame:
从磁盘加载CSV数据集。
参数
path : str
CSV文件的路径(相对于项目根目录)。
返回
pd.DataFrame
保留原始列名的原始数据集。
import pandas as pd
return pd.read_csv(path)
2b. 参数化硬编码值
python
from pathlib import Path
import argparse
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(--data, type=Path, default=Path(data/raw.csv))
parser.add_argument(--output, type=Path, default=Path(results/))
return parser.parse_args()
args = parse_args()
df = pd.read_csv(args.data)
args.output.mkdir(parents=True, exist_ok=True)
2c. 设置随机种子
python
import random
import numpy as np
SEED = 42 # 在模块级别记录此常量
random.seed(SEED)
np.random.seed(SEED)
scikit-learn
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=SEED)
PyTorch
import torch
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
2d. 添加错误处理和日志记录
python
import logging
from pathlib import Path
logging.basicConfig(level=logging.INFO, format=%(asctime)s [%(levelname)s] %(message)s)
logger = logging.getLogger(name)
def load_data(path: Path) -> pd.DataFrame:
带验证的数据集加载。
import pandas as pd
if not path.exists():
raise FileNotFoundError(f数据文件未找到:{path})
logger.info(正在从 %s 加载数据, path)
df = pd.read_csv(path)
if df.empty:
raise ValueError(f加载的数据框为空:{path})
logger.info(已加载 %d 行,%d 列, *df.shape)
return df
步骤3:生成环境规范
完整的Dockerfile和Conda环境模板请参见 references/environment-setup.md。
requirements.txt (pip)
bash
pip install pipreqs
pipreqs src/ --output requirements.txt --force
验证依赖解析:
bash
python -m venv .venvtest && source .venvtest/bin/activate
pip install -r requirements.txt
python -c import pandas, numpy, sklearn
deactivate && rm -rf .venv_test
environment.yml (Conda)
yaml
name: my-research-env
channels:
- conda-forge
- defaults
dependencies:
- python=3.9
- numpy=1.24.3
- pandas=2.0.1
- scikit-learn=1.2.2
- matplotlib=3.7.1
- pip:
- some-pip-only-package==0.5.0
bash
conda env create -f environment.yml
conda activate my-research-env
步骤4:创建文档
README结构
生成一个至少包含以下内容的 README.md:
markdown
要求
安装
bash
conda env create -f environment.yml
conda activate my-research-env
数据
运行分析
bash
python main.py --data data/raw.csv --output results/
预期输出
复现结果
- - 随机种子:42(在 config.py 中设置)
- 硬件:结果已在CPU上验证;GPU结果可能略有差异
步骤5:验证可复现性
完成所有更改后,确认行为未改变:
bash
1. 运行完整流程并捕获输出校验和
python main.py --data data/raw.csv --output results/
md5sum results/*.csv > checksums_refactored.md5
diff checksums
original.md5 checksumsrefactored.md5
2. 运行单元测试
pytest tests/ -v --tb=short
3. 确认两次干净运行的确定性
python main.py --output results_run1/
python main.py --output results_run2/
diff -r results
run1/ resultsrun2/
可复现性验证检查清单:
- - [ ] 输出校验和与重构前基线匹配
- [ ] 所有测试通过
- [ ] 流程运行两次并产生相同输出
- [ ] requirements.txt / environment.yml 能在全新环境中干净安装
- [ ] 源文件中无绝对路径残留
- [ ] 随机种子已设置并记录
- [ ] 所有公共函数都有文档字符串
- [ ] README包含完整的复现说明
最佳实践总结
| 实践 |
|---|
| 仅使用相对路径 |
| 锁定依赖版本 |
| 设置随机种子 |
| 所有公共函数添加文档字符串 |
| 对照基线验证输出 |
| 自动化环境设置 |
参考资料
- - references/guide.md — 综合用户指南
- references/environment-setup.md — Dockerfile和完整环境模板
- references/examples/ — 可运行的代码示例
- references/api-docs/ — 完整API文档
技能ID:455 |
版本:1.0 |
许可证:MIT