Sitemap Content Scraper

Use this skill to turn a public website into a sitemap-driven scraping job. Prefer the existing sitemap structure over ad hoc crawling so the scrape stays bounded, explainable, and easy for the user to steer.

Workflow

1. Ask for the website or URL scope if it is not already provided.
Run python3 {baseDir}/scripts/discover_sitemaps.py <site-or-url>.
Summarize the discovered sitemap inventory in plain language.
If user gave a scoped URL (for example https://example.com/docs), use scope_hint_substring from discovery output as default filter guidance.
Ask which content family the user wants, such as documentation, knowledge base, blog, academy, changelog, or another category.
Map the user request to the most relevant sitemap by name and sample URL patterns.
If multiple sitemaps still match, ask the user to choose one or give a tighter scope.
Ask for the destination folder if it is missing.
Run python3 {baseDir}/scripts/scrape_sitemap.py --sitemap-url <chosen-sitemap> --output-dir <destination>, and when a scoped URL was provided add --include-substring <scope_hint_substring> unless the user overrides scope.
Report what was scraped, where it was saved, and any skipped or failed pages.

Quick Commands

Discover sitemap inventory:

CODEBLOCK0

Discover and preserve scope hint from a direct URL prompt:

CODEBLOCK1

Scrape one sitemap into a chosen folder:

CODEBLOCK2

Filter to a subset of URLs when the sitemap mixes sections:

CODEBLOCK3

Selection Rules

- Prefer sitemaps explicitly named for the requested content family, such as docs-sitemap.xml, post-sitemap.xml, kb-sitemap.xml, or academy-sitemap.xml.
Use the sample URLs returned by discover_sitemaps.py to explain why a sitemap looks like docs, blog, help center, or another category.
If the request is broad, offer the discovered choices instead of scraping everything by default.
If no sitemap exists, stop and ask whether the user wants a bounded crawl workflow instead. Do not silently switch strategies.

Output Contract

- Save one Markdown file per scraped page.
Save manifest.json at the output root with success and failure details.
Keep source URLs in the Markdown header so the corpus remains traceable.
Preserve a stable folder structure derived from the source URL path.

Read {baseDir}/references/sitemap-selection.md when mapping user intent to sitemap candidates, handling ambiguous sitemap names, or explaining the output layout.

Trigger Examples

- "Scrape example.com/docs content into ./out/docs."
"Pull the help center pages from https://example.com/help."
"Find blog sitemaps for example.com and scrape only posts."

Guardrails

- Scrape only public content.
Accept only http and https targets.
Reject localhost, private IP ranges, and internal-only hostnames.
Enforce public-only targets using both hostname resolution checks and redirect-target checks at request time.
Respect the chosen sitemap scope instead of broad site crawling.
Avoid login flows, private dashboards, carts, checkout paths, or user-specific pages.
Do not use authentication headers, cookies, or tokens.
Ask before writing outside the intended working area.
Tell the user when extraction quality looks weak on JavaScript-heavy pages. The bundled scraper is HTML-first and may miss client-rendered content.

网站地图内容抓取器

使用此技能将公共网站转化为基于网站地图驱动的抓取任务。优先使用现有的网站地图结构，而非临时爬取，以确保抓取范围可控、可解释且易于用户引导。

工作流程

1. 如果尚未提供网站或URL范围，请询问用户。
运行 python3 {baseDir}/scripts/discoversitemaps.py <网站或URL>。
用通俗语言总结发现的网站地图清单。
如果用户提供了限定范围的URL（例如 https://example.com/docs），则使用发现结果中的 scopehintsubstring 作为默认过滤指引。
询问用户想要抓取的内容类型，例如文档、知识库、博客、学院、更新日志或其他类别。
根据名称和示例URL模式，将用户请求映射到最相关的网站地图。
如果仍有多个网站地图匹配，请用户选择一个或提供更精确的范围。
如果缺少目标文件夹，请询问用户。
运行 python3 {baseDir}/scripts/scrapesitemap.py --sitemap-url <选定的网站地图> --output-dir <目标文件夹>，如果提供了限定范围的URL，则添加 --include-substring hintsubstring>，除非用户覆盖了范围。
报告抓取的内容、保存位置以及任何跳过或失败的页面。

快速命令

发现网站地图清单：

bash
python3 {baseDir}/scripts/discover_sitemaps.py https://example.com

从直接URL提示中发现并保留范围提示：

bash
python3 {baseDir}/scripts/discover_sitemaps.py https://example.com/docs

将单个网站地图抓取到指定文件夹：

bash
python3 {baseDir}/scripts/scrape_sitemap.py \
--sitemap-url https://example.com/docs-sitemap.xml \
--output-dir /tmp/example-docs

当网站地图混合多个部分时，过滤到URL子集：

bash
python3 {baseDir}/scripts/scrape_sitemap.py \
--sitemap-url https://example.com/sitemap.xml \
--output-dir /tmp/example-docs \
--include-substring /docs/ \
--exclude-substring /tag/

选择规则

- 优先选择明确以请求内容类型命名的网站地图，例如 docs-sitemap.xml、post-sitemap.xml、kb-sitemap.xml 或 academy-sitemap.xml。
使用 discover_sitemaps.py 返回的示例URL来解释为什么某个网站地图看起来像文档、博客、帮助中心或其他类别。
如果请求范围较宽泛，则提供发现的选项供用户选择，而非默认抓取所有内容。
如果不存在网站地图，则停止并询问用户是否希望改用受限的爬取工作流程。不要静默切换策略。

输出约定

- 每个抓取的页面保存一个Markdown文件。
在输出根目录保存 manifest.json，包含成功和失败的详细信息。
在Markdown头部保留源URL，以便语料库可追溯。
保持从源URL路径派生的稳定文件夹结构。

在将用户意图映射到候选网站地图、处理模糊的网站地图名称或解释输出布局时，请阅读 {baseDir}/references/sitemap-selection.md。

触发示例

- 将 example.com/docs 的内容抓取到 ./out/docs。
从 https://example.com/help 拉取帮助中心页面。
查找 example.com 的博客网站地图，仅抓取文章。

安全护栏

- 仅抓取公共内容。
仅接受 http 和 https 目标。
拒绝 localhost、私有IP范围以及仅限内部的主机名。
在请求时通过主机名解析检查和重定向目标检查，强制仅限公共目标。
尊重选定的网站地图范围，而非广泛的站点爬取。
避免登录流程、私有仪表盘、购物车、结账路径或用户特定页面。
不使用认证头、Cookie或令牌。
在写入预期工作区域之外时需询问用户。
当JavaScript密集型页面的提取质量较差时，告知用户。捆绑的抓取器以HTML优先，可能遗漏客户端渲染的内容。

sitemap_content_scraper网站地图内容抓取