Research Code Reproducibility Refactoring Tool

Workflow Overview

Follow this sequence when refactoring a research codebase:

1. Analyze — identify reproducibility issues in existing code
Refactor — apply documentation, parameterization, and error handling
Specify environment — pin dependencies and create environment files
Validate — run tests and verify behaviour is unchanged

Step 1: Analyze Code for Reproducibility Issues

Read each source file and check for the following problems. Document findings before making any changes.

Checklist: missing docstrings · hardcoded absolute paths · missing random seeds · bare except: clauses · unpinned imports · unexplained magic numbers

Example — detecting issues manually:

CODEBLOCK0

Step 2: Refactor for Best Practices

Apply improvements in place. Always back up originals first.

2a. Add docstrings

CODEBLOCK1

2b. Parameterize hardcoded values

CODEBLOCK2

2c. Set random seeds

CODEBLOCK3

2d. Add error handling and logging

CODEBLOCK4

Step 3: Generate Environment Specifications

See references/environment-setup.md for full Dockerfile and Conda environment templates.

requirements.txt (pip)

CODEBLOCK5

Verify resolution:
CODEBLOCK6

environment.yml (Conda)

CODEBLOCK7

CODEBLOCK8

Step 4: Create Documentation

README structure

Generate a README.md containing at minimum:

CODEBLOCK9bash
conda env create -f environment.yml
conda activate my-research-env


## Data
<!-- Describe input data format, source, and where to place files -->

## Running the Analysis

bash
python main.py --data data/raw.csv --output results/


## Expected Outputs
<!-- Describe files created and how to interpret them -->

## Reproducing Results
- Random seed: 42 (set in `config.py`)
- Hardware: results validated on CPU; GPU results may differ slightly

Step 5: Validate Reproducibility

After all changes, verify that behaviour is unchanged:

CODEBLOCK12

Reproducibility verification checklist:

- [ ] Output checksums match pre-refactor baseline
[ ] All tests pass
[ ] Pipeline runs twice and produces identical outputs
[ ] requirements.txt / environment.yml installs cleanly in a fresh environment
[ ] No absolute paths remain in source files
[ ] Random seeds are set and documented
[ ] All public functions have docstrings
[ ] README contains complete reproduction instructions

Best Practices Summary

Practice
Relative paths only
Pin dependency versions
Set random seeds
Docstrings on all public functions
Validate outputs against a baseline
Automate environment setup

References

- references/guide.md — Comprehensive user guide
INLINECODE6 — Dockerfile and full environment templates
INLINECODE7 — Working code examples
INLINECODE8 — Complete API documentation

Skill ID: 455 | Version: 1.0 | License: MIT

研究代码可复现性重构工具

工作流程概览

重构研究代码库时，请遵循以下顺序：

1. 分析 — 识别现有代码中的可复现性问题
重构 — 应用文档、参数化和错误处理
指定环境 — 锁定依赖并创建环境文件
验证 — 运行测试并确认行为未改变

步骤1：分析代码的可复现性问题

阅读每个源文件并检查以下问题。在进行任何更改前，记录发现的问题。

检查清单： 缺少文档字符串 · 硬编码的绝对路径 · 缺少随机种子 · 裸except:子句 · 未锁定版本的导入 · 未解释的魔法数字

示例 — 手动检测问题：

python
import ast, pathlib

def findhardcodedpaths(source: str) -> list[str]:
返回看起来像绝对路径的字符串字面量。
tree = ast.parse(source)
return [
node.s for node in ast.walk(tree)
if isinstance(node, ast.Constant)
and isinstance(node.s, str)
and node.s.startswith(/)
]

source = pathlib.Path(analysis.py).read_text()
print(findhardcodedpaths(source))

步骤2：按最佳实践进行重构

就地应用改进。务必先备份原始文件。

2a. 添加文档字符串

python

之前

def load_data(path):
import pandas as pd
return pd.read_csv(path)

之后

def load_data(path: str) -> pd.DataFrame: 从磁盘加载CSV数据集。

参数

path : str
CSV文件的路径（相对于项目根目录）。

pd.DataFrame
保留原始列名的原始数据集。

import pandas as pd
return pd.read_csv(path)

2b. 参数化硬编码值

python
from pathlib import Path
import argparse

def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(--data, type=Path, default=Path(data/raw.csv))
parser.add_argument(--output, type=Path, default=Path(results/))
return parser.parse_args()

args = parse_args()
df = pd.read_csv(args.data)
args.output.mkdir(parents=True, exist_ok=True)

2c. 设置随机种子

python
import random
import numpy as np

SEED = 42 # 在模块级别记录此常量

random.seed(SEED)
np.random.seed(SEED)

scikit-learn

from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(random_state=SEED)

PyTorch

import torch torch.manual_seed(SEED) torch.backends.cudnn.deterministic = True

2d. 添加错误处理和日志记录

python
import logging
from pathlib import Path

logging.basicConfig(level=logging.INFO, format=%(asctime)s [%(levelname)s] %(message)s)
logger = logging.getLogger(name)

def load_data(path: Path) -> pd.DataFrame:
带验证的数据集加载。
import pandas as pd
if not path.exists():
raise FileNotFoundError(f数据文件未找到：{path})
logger.info(正在从 %s 加载数据, path)
df = pd.read_csv(path)
if df.empty:
raise ValueError(f加载的数据框为空：{path})
logger.info(已加载 %d 行，%d 列, *df.shape)
return df

步骤3：生成环境规范

完整的Dockerfile和Conda环境模板请参见 references/environment-setup.md。

requirements.txt (pip)

bash
pip install pipreqs
pipreqs src/ --output requirements.txt --force

验证依赖解析：
bash
python -m venv .venvtest && source .venvtest/bin/activate
pip install -r requirements.txt
python -c import pandas, numpy, sklearn
deactivate && rm -rf .venv_test

environment.yml (Conda)

yaml
name: my-research-env
channels:
- conda-forge
- defaults
dependencies:
- python=3.9
- numpy=1.24.3
- pandas=2.0.1
- scikit-learn=1.2.2
- matplotlib=3.7.1
- pip:
- some-pip-only-package==0.5.0

bash
conda env create -f environment.yml
conda activate my-research-env

步骤4：创建文档

README结构

生成一个至少包含以下内容的 README.md：

markdown

要求

安装

bash conda env create -f environment.yml conda activate my-research-env

数据

运行分析

bash python main.py --data data/raw.csv --output results/

预期输出

复现结果

- 随机种子：42（在 config.py 中设置）
硬件：结果已在CPU上验证；GPU结果可能略有差异

步骤5：验证可复现性

完成所有更改后，确认行为未改变：

bash

1. 运行完整流程并捕获输出校验和

python main.py --data data/raw.csv --output results/
md5sum results/*.csv > checksums_refactored.md5
diff checksumsoriginal.md5 checksumsrefactored.md5

2. 运行单元测试

pytest tests/ -v --tb=short

3. 确认两次干净运行的确定性

python main.py --output results_run1/ python main.py --output results_run2/ diff -r resultsrun1/ resultsrun2/

可复现性验证检查清单：

- [ ] 输出校验和与重构前基线匹配
[ ] 所有测试通过
[ ] 流程运行两次并产生相同输出
[ ] requirements.txt / environment.yml 能在全新环境中干净安装
[ ] 源文件中无绝对路径残留
[ ] 随机种子已设置并记录
[ ] 所有公共函数都有文档字符串
[ ] README包含完整的复现说明

最佳实践总结

实践
仅使用相对路径
锁定依赖版本
设置随机种子
所有公共函数添加文档字符串
对照基线验证输出
自动化环境设置

参考资料

- references/guide.md — 综合用户指南
references/environment-setup.md — Dockerfile和完整环境模板
references/examples/ — 可运行的代码示例
references/api-docs/ — 完整API文档

技能ID：455 | 版本：1.0 | 许可证：MIT

code-refactor-for-reproducibility代码重构可复现

code-refactor-for-reproducibility

Research Code Reproducibility Refactoring Tool

Workflow Overview

Step 1: Analyze Code for Reproducibility Issues

Step 2: Refactor for Best Practices

2a. Add docstrings

2b. Parameterize hardcoded values

2c. Set random seeds

2d. Add error handling and logging

Step 3: Generate Environment Specifications

requirements.txt (pip)

environment.yml (Conda)

Step 4: Create Documentation

README structure

Step 5: Validate Reproducibility

Best Practices Summary

References

研究代码可复现性重构工具

工作流程概览

步骤1：分析代码的可复现性问题

步骤2：按最佳实践进行重构

2a. 添加文档字符串

之前

之后

2b. 参数化硬编码值

2c. 设置随机种子

scikit-learn

PyTorch

2d. 添加错误处理和日志记录

步骤3：生成环境规范

requirements.txt (pip)

environment.yml (Conda)

步骤4：创建文档

README结构

要求

安装

数据

运行分析

预期输出

复现结果

步骤5：验证可复现性

1. 运行完整流程并捕获输出校验和

2. 运行单元测试

3. 确认两次干净运行的确定性

最佳实践总结

参考资料

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement