Simulacrum Data Annotation Workflow
Complete end-to-end workflow for time series dataset preparation and annotation on the Data Annotation platform (data.smlcrm.com).
What This Skill Does
This skill captures the precise workflow for processing time series datasets (Energy, Manufacturing, Climate) from discovery to CLEAN status:
- 1. Find Dataset: Search Kaggle for Energy/Manufacturing/Climate time series data
- Download: Get CSV files via browser or Kaggle CLI
- Clean: Run Python/pandas script to handle missing values, duplicates, formatting
- Upload RAW: Upload original CSV with metadata (name, domain, source URL, description)
- Configure Headers: Set column types (Time, Target, Covariate, Group) and units
- Assign Groups: Select ALL variables (target + covariates), apply ALL group tags
- Upload Cleaned: Final upload → CLEAN status
Supported Domains
- - Energy: Power consumption, utilities, renewable energy, grid data
- Manufacturing: Industrial processes, steel production, emissions, equipment data
- Climate: CO2 emissions, environmental monitoring, weather correlation data
Quick Start
For the full pipeline from Kaggle to annotated dataset:
CODEBLOCK0
Workflow Steps
Step 1: Find and Download Dataset
From Kaggle (Browser Method):
- 1. Navigate to kaggle.com/datasets
- Search for relevant dataset (e.g., "steel industry energy consumption", "manufacturing emissions", "climate CO2")
- Review data description, file list, and preview
- Click "Download" button
- Extract CSV file from downloaded zip
Alternative: Kaggle CLI
CODEBLOCK1
Step 2: Clean the Dataset
Always run the cleaning script before upload:
CODEBLOCK2
What the script does:
- - Strips whitespace from column names
- Removes duplicate rows
- Fills missing numeric values with median
- Fills missing categorical values with mode or 'Unknown'
- Converts timestamp columns to datetime format
- Outputs column summary for metadata configuration
Output:
- - Cleaned CSV file ready for upload
- Column summary printed to console (save this for metadata config)
Step 3: Upload Raw Dataset to Platform
- 1. Navigate to data.smlcrm.com/dashboard
- Click "Upload Dataset" button
- Fill in metadata for the RAW dataset:
-
Name: Descriptive dataset name
-
Domain: Category (Energy, Manufacturing, Climate, etc.)
-
Source URL: Kaggle or original source URL
-
Description: Brief summary of the dataset
- 4. Upload the original/raw CSV file (not cleaned yet)
- Click Upload
Result: Dataset appears in list with RAW status
Step 4: Upload Cleaned File & Configure Metadata
- 1. Find the RAW dataset in the list
- Click "Clean" button
- Upload the cleaned CSV file (from Step 2)
- Configure headers for each column:
| Setting | Description |
|---|
| Name | Column name (editable) |
| Units |
Measurement units (kWh, °C, %, ratio, tCO2, etc.) |
|
Type | Time / Target / Covariate / Group |
Column Type Guide:
- - Time: Timestamp/datetime columns (usually required)
- Target: Variable to predict (at least one required)
- Covariate: Input features/independent variables
- Group: Categorical segment variables (WeekStatus, Dayofweek, Load_Type, etc.)
Bulk Configuration:
- - Select multiple rows via checkboxes
- Use "Apply" dropdown to set type for selected columns
- Set units individually or in bulk
Common Unit Patterns:
- - Energy: kWh, MWh, MW
- Power: kVarh, kW
- Emissions: tCO2, kgCO2
- Ratios: ratio, %
- Time: seconds, minutes, hours
Step 5: Assign Groups to Variables
Purpose: Group variables define how data is segmented for analysis.
Exact Workflow:
- 1. Select ALL variables by checking their checkboxes:
- Target variable(s)
- ALL covariate variables
- 2. Apply ALL group tags to selected variables:
- Click first group tag (e.g., WeekStatus) → all selected get this group
- Click second group tag (e.g., Day
ofweek) → all selected get this group
- Click third group tag (e.g., Load_Type) → all selected get this group
- Continue for all available group tags
- 3. Result: All variables have all groups assigned (e.g., "WeekStatus × Dayofweek × Load_Type")
Important: Assign groups to BOTH target variables AND all covariates.
Step 6: Final Upload
- 1. Click "Upload Cleaned Dataset" button
- Wait for processing
- Dataset status changes from RAW → CLEAN
- Verify data points count is correct
Example: Steel Industry Energy Dataset
Source: https://www.kaggle.com/datasets/csafrit2/steel-industry-energy-consumption
Metadata:
- - Name: Steel Industry Energy Consumption (South Korea)
- Domain: Energy
- Data Points: 350,400
Column Configuration:
| Column | Type | Units |
|---|
| Timestamps | Time | - |
| Usage_kWh |
Target | kWh |
| Lagging
CurrentReactive.Power_kVarh | Covariate | kVarh |
| Leading
CurrentReactive
PowerkVarh | Covariate | kVarh |
| CO2(tCO2) | Covariate | tCO2 |
| Lagging
CurrentPower_Factor | Covariate | ratio |
| Leading
CurrentPower_Factor | Covariate | ratio |
| NSM | Covariate | seconds |
| WeekStatus | Group | - |
| Day
ofweek | Group | - |
| Load_Type | Group | - |
Group Assignment:
- 1. Select: UsagekWh, LaggingCurrentReactive.PowerkVarh, LeadingCurrentReactivePowerkVarh, CO2(tCO2), LaggingCurrentPowerFactor, LeadingCurrentPowerFactor, NSM
- Click: WeekStatus → all selected get WeekStatus
- Click: Dayofweek → all selected get Dayofweek
- Click: LoadType → all selected get LoadType
- Final: All variables show "WeekStatus × Dayofweek × Load_Type"
Reference Materials
For detailed platform configuration guidance, see references/platform_guide.md.
Troubleshooting
"Next" button disabled:
- - Check at least one Time column is set
- Check at least one Target column is set
- Verify all columns have types assigned
Groups not appearing:
- - Columns must be marked as "Group" type first
- Proceed to next step after setting Group types
Upload fails:
- - Re-run cleaning script
- Check CSV format (comma-delimited)
- Verify no empty column names
Scripts
| Script | Purpose |
|---|
| INLINECODE0 | Clean and prepare CSV for upload |
| INLINECODE1 |
Download datasets via Kaggle CLI |
Platform URL
Data Annotation Platform: https://data.smlcrm.com
模拟数据标注工作流程
在数据标注平台(data.smlcrm.com)上完成时间序列数据集准备和标注的完整端到端工作流程。
该技能的功能
该技能记录了从发现到CLEAN状态处理时间序列数据集(能源、制造、气候)的精确工作流程:
- 1. 查找数据集:在Kaggle上搜索能源/制造/气候时间序列数据
- 下载:通过浏览器或Kaggle CLI获取CSV文件
- 清洗:运行Python/pandas脚本处理缺失值、重复项、格式问题
- 上传RAW:上传原始CSV文件及元数据(名称、领域、来源URL、描述)
- 配置表头:设置列类型(时间、目标、协变量、分组)和单位
- 分配分组:选择所有变量(目标+协变量),应用所有分组标签
- 上传清洗后数据:最终上传 → CLEAN状态
支持的领域
- - 能源:电力消耗、公用事业、可再生能源、电网数据
- 制造:工业流程、钢铁生产、排放、设备数据
- 气候:CO2排放、环境监测、天气相关性数据
快速开始
从Kaggle到标注数据集的完整流程:
- 1. 在Kaggle上查找数据集
- 下载(浏览器或kaggle CLI)
- 使用scripts/clean_dataset.py清洗
- 上传RAW数据集到data.smlcrm.com(附带元数据)
- 点击Clean并上传清洗后的文件
- 配置列元数据(类型、单位)
- 为变量分配分组
- 上传清洗后的数据集 → CLEAN状态
工作流程步骤
步骤1:查找并下载数据集
从Kaggle(浏览器方法):
- 1. 导航至kaggle.com/datasets
- 搜索相关数据集(例如钢铁行业能源消耗、制造业排放、气候CO2)
- 查看数据描述、文件列表和预览
- 点击Download按钮
- 从下载的zip文件中提取CSV文件
替代方案:Kaggle CLI
bash
如需安装:pip install kaggle
配置:kaggle competitions list
scripts/download_kaggle.sh <数据集名称> [输出目录]
示例:scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption
步骤2:清洗数据集
上传前务必运行清洗脚本:
bash
python3 scripts/clean_dataset.py <输入.csv> [-o <输出.csv>]
脚本功能:
- - 去除列名中的空白字符
- 删除重复行
- 用中位数填充缺失的数值
- 用众数或Unknown填充缺失的分类值
- 将时间戳列转换为日期时间格式
- 输出列摘要用于元数据配置
输出:
- - 清洗后的CSV文件,准备上传
- 列摘要打印到控制台(保存用于元数据配置)
步骤3:上传原始数据集到平台
- 1. 导航至data.smlcrm.com/dashboard
- 点击Upload Dataset按钮
- 填写RAW数据集的元数据:
-
名称:描述性数据集名称
-
领域:类别(能源、制造、气候等)
-
来源URL:Kaggle或原始来源URL
-
描述:数据集的简要摘要
- 4. 上传原始/未清洗的CSV文件(尚未清洗)
- 点击Upload
结果: 数据集出现在列表中,状态为RAW
步骤4:上传清洗后的文件并配置元数据
- 1. 在列表中找到RAW数据集
- 点击Clean按钮
- 上传清洗后的CSV文件(来自步骤2)
- 为每列配置表头:
测量单位(kWh、°C、%、比率、tCO2等) |
|
类型 | 时间 / 目标 / 协变量 / 分组 |
列类型指南:
- - 时间:时间戳/日期时间列(通常必需)
- 目标:要预测的变量(至少需要一个)
- 协变量:输入特征/自变量
- 分组:分类分段变量(WeekStatus、Dayofweek、Load_Type等)
批量配置:
- - 通过复选框选择多行
- 使用Apply下拉菜单为所选列设置类型
- 单独或批量设置单位
常见单位模式:
- - 能源:kWh、MWh、MW
- 功率:kVarh、kW
- 排放:tCO2、kgCO2
- 比率:比率、%
- 时间:秒、分钟、小时
步骤5:为变量分配分组
目的: 分组变量定义数据如何分段用于分析。
精确工作流程:
- 1. 选择所有变量,勾选其复选框:
- 目标变量
- 所有协变量
- 2. 对所有选定变量应用所有分组标签:
- 点击第一个分组标签(例如WeekStatus)→ 所有选定变量获得此分组
- 点击第二个分组标签(例如Day
ofweek)→ 所有选定变量获得此分组
- 点击第三个分组标签(例如Load_Type)→ 所有选定变量获得此分组
- 对所有可用分组标签重复此操作
- 3. 结果: 所有变量都分配了所有分组(例如WeekStatus × Dayofweek × Load_Type)
重要提示: 将分组分配给目标变量和所有协变量。
步骤6:最终上传
- 1. 点击Upload Cleaned Dataset按钮
- 等待处理
- 数据集状态从RAW变为CLEAN
- 验证数据点数量是否正确
示例:钢铁行业能源数据集
来源: https://www.kaggle.com/datasets/csafrit2/steel-industry-energy-consumption
元数据:
- - 名称: 钢铁行业能源消耗(韩国)
- 领域: 能源
- 数据点: 350,400
列配置:
| 列 | 类型 | 单位 |
|---|
| Timestamps | 时间 | - |
| Usage_kWh |
目标 | kWh |
| Lagging
CurrentReactive.Power_kVarh | 协变量 | kVarh |
| Leading
CurrentReactive
PowerkVarh | 协变量 | kVarh |
| CO2(tCO2) | 协变量 | tCO2 |
| Lagging
CurrentPower_Factor | 协变量 | 比率 |
| Leading
CurrentPower_Factor | 协变量 | 比率 |
| NSM | 协变量 | 秒 |
| WeekStatus | 分组 | - |
| Day
ofweek | 分组 | - |
| Load_Type | 分组 | - |
分组分配:
- 1. 选择:UsagekWh、LaggingCurrentReactive.PowerkVarh、LeadingCurrentReactivePowerkVarh、CO2(tCO2)、LaggingCurrentPowerFactor、LeadingCurrentPowerFactor、NSM
- 点击:WeekStatus → 所有选定变量获得WeekStatus
- 点击:Dayofweek → 所有选定变量获得Dayofweek
- 点击:LoadType → 所有选定变量获得LoadType
- 最终:所有变量显示WeekStatus × Dayofweek × Load_Type
参考资料
有关详细的平台配置指南,请参阅references/platform_guide.md。
故障排除
Next按钮禁用:
- - 检查是否至少设置了一个时间列
- 检查是否至少设置了一个目标列
- 验证所有列都已分配类型
分组未显示:
- - 列必须首先标记为Group类型
- 设置分组类型后继续下一步
上传失败:
- - 重新运行清洗脚本
- 检查CSV格式(逗号分隔)
- 验证没有空列名
脚本
| 脚本 | 目的 |
|---|
| scripts/cleandataset.py | 清洗并准备CSV文件用于上传 |
| scripts/downloadkaggle.sh |
通过Kaggle CLI下载数据集 |
平台URL
数据标注平台:https://data.smlcrm.com