web-scraping网页抓取

Extract structured information from websites using web_fetch for simple pages and browser automation for dynamic sites, login-gated flows, pagination, infinite scroll, or pages that require clicking/searching/filtering. Use when the user asks to scrape a site, collect listings, gather fields from many pages, monitor website changes, or turn webpage content into structured summaries/JSON/CSV.

作者: admin | 来源: ClawHub

Web Scraping

Extract data with the lightest reliable method first.

Choose the approach

1. Use web_fetch for simple public pages when the needed content is already in HTML.
Use browser when the site is dynamic, needs clicking, infinite scroll, filters, tabs, or login/session state.
Use web_search only to discover candidate pages when the target URL is unknown.

Default workflow

1. Identify the target site and exact fields to collect.
Test one page first.
Decide the extraction method:

- web_fetch for readable article/listing text - browser snapshot for dynamic DOM inspection

4. Normalize the output into a stable schema.
If scraping multiple pages, avoid tight loops and serialize requests.
Deduplicate by URL or stable item id.
Save results in the workspace when the task is larger than a quick one-off.

Browser scraping pattern

1. Open the page.
Take a snapshot.
Interact only as needed: search, click filters, pagination, expand sections.
Re-snapshot after each meaningful state change.
Extract only the fields the user asked for.
Close tabs when finished.

Output guidance

Prefer one of these formats:

- concise bullet summary
JSON array of objects
CSV/TSV when the user wants exportable rows

Use explicit keys, for example:

CODEBLOCK0

Reliability rules

- Do not invent missing fields.
If a site blocks access, say so and switch sources when appropriate.
For news/results pages, capture source + title + link at minimum.
For large jobs, checkpoint partial results to a workspace file.
Prefer fewer larger writes over many tiny writes.

Cleanup

- Close browser tabs opened for scraping.
If you create state/output files, store them under the workspace and name them clearly.

Web Scraping

优先使用最轻量可靠的方法提取数据。

选择方法

1. 对于所需内容已存在于HTML中的简单公开页面，使用webfetch。
当网站是动态的，需要点击、无限滚动、筛选、标签页或登录/会话状态时，使用browser。
仅在目标URL未知时，使用websearch发现候选页面。

默认工作流程

1. 确定目标网站和需要收集的具体字段。
先测试一个页面。
决定提取方法：

- 对于可读的文章/列表文本，使用web_fetch - 对于动态DOM检查，使用browser snapshot

4. 将输出规范化为稳定的模式。
如果抓取多个页面，避免紧密循环并序列化请求。
按URL或稳定的项目ID去重。
当任务规模大于一次性快速操作时，将结果保存到工作区。

浏览器抓取模式

1. 打开页面。
拍摄快照。
仅按需进行交互：搜索、点击筛选、分页、展开部分。
每次有意义的状态变化后重新拍摄快照。
仅提取用户要求的字段。
完成后关闭标签页。

输出指南

优先使用以下格式之一：

- 简洁的要点总结
JSON对象数组
当用户需要可导出行时使用CSV/TSV

使用明确的键，例如：

json
[
{
title: ...,
url: ...,
source: ...,
date: ...,
summary: ...
}
]

可靠性规则

- 不要虚构缺失的字段。
如果网站阻止访问，如实说明并在适当时切换来源。
对于新闻/结果页面，至少捕获来源+标题+链接。
对于大型任务，将部分结果检查点保存到工作区文件。
优先进行较少次数的大批量写入，而不是多次小批量写入。

清理

- 关闭为抓取而打开的浏览器标签页。
如果创建了状态/输出文件，将其存储在工作区下并清晰命名。

web-scraping网页抓取

web-scraping

Web Scraping

Choose the approach

Default workflow

Browser scraping pattern

Output guidance

Reliability rules

Cleanup

Web Scraping

选择方法

默认工作流程

浏览器抓取模式

输出指南

可靠性规则

清理

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

web-scraping网页抓取

web-scraping

Web Scraping

Choose the approach

Default workflow

Browser scraping pattern

Output guidance

Reliability rules

Cleanup

Web Scraping

选择方法

默认工作流程

浏览器抓取模式

输出指南

可靠性规则

清理

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement