Sitemap Content Scraper
Use this skill to turn a public website into a sitemap-driven scraping job. Prefer the existing sitemap structure over ad hoc crawling so the scrape stays bounded, explainable, and easy for the user to steer.
Workflow
- 1. Ask for the website or URL scope if it is not already provided.
- Run
python3 {baseDir}/scripts/discover_sitemaps.py <site-or-url>. - Summarize the discovered sitemap inventory in plain language.
- If user gave a scoped URL (for example
https://example.com/docs), use scope_hint_substring from discovery output as default filter guidance. - Ask which content family the user wants, such as documentation, knowledge base, blog, academy, changelog, or another category.
- Map the user request to the most relevant sitemap by name and sample URL patterns.
- If multiple sitemaps still match, ask the user to choose one or give a tighter scope.
- Ask for the destination folder if it is missing.
- Run
python3 {baseDir}/scripts/scrape_sitemap.py --sitemap-url <chosen-sitemap> --output-dir <destination>, and when a scoped URL was provided add --include-substring <scope_hint_substring> unless the user overrides scope. - Report what was scraped, where it was saved, and any skipped or failed pages.
Quick Commands
Discover sitemap inventory:
CODEBLOCK0
Discover and preserve scope hint from a direct URL prompt:
CODEBLOCK1
Scrape one sitemap into a chosen folder:
CODEBLOCK2
Filter to a subset of URLs when the sitemap mixes sections:
CODEBLOCK3
Selection Rules
- - Prefer sitemaps explicitly named for the requested content family, such as
docs-sitemap.xml, post-sitemap.xml, kb-sitemap.xml, or academy-sitemap.xml. - Use the sample URLs returned by
discover_sitemaps.py to explain why a sitemap looks like docs, blog, help center, or another category. - If the request is broad, offer the discovered choices instead of scraping everything by default.
- If no sitemap exists, stop and ask whether the user wants a bounded crawl workflow instead. Do not silently switch strategies.
Output Contract
- - Save one Markdown file per scraped page.
- Save
manifest.json at the output root with success and failure details. - Keep source URLs in the Markdown header so the corpus remains traceable.
- Preserve a stable folder structure derived from the source URL path.
Read {baseDir}/references/sitemap-selection.md when mapping user intent to sitemap candidates, handling ambiguous sitemap names, or explaining the output layout.
Trigger Examples
- - "Scrape
example.com/docs content into ./out/docs." - "Pull the help center pages from
https://example.com/help." - "Find blog sitemaps for
example.com and scrape only posts."
Guardrails
- - Scrape only public content.
- Accept only
http and https targets. - Reject
localhost, private IP ranges, and internal-only hostnames. - Enforce public-only targets using both hostname resolution checks and redirect-target checks at request time.
- Respect the chosen sitemap scope instead of broad site crawling.
- Avoid login flows, private dashboards, carts, checkout paths, or user-specific pages.
- Do not use authentication headers, cookies, or tokens.
- Ask before writing outside the intended working area.
- Tell the user when extraction quality looks weak on JavaScript-heavy pages. The bundled scraper is HTML-first and may miss client-rendered content.
网站地图内容抓取器
使用此技能将公共网站转化为基于网站地图驱动的抓取任务。优先使用现有的网站地图结构,而非临时爬取,以确保抓取范围可控、可解释且易于用户引导。
工作流程
- 1. 如果尚未提供网站或URL范围,请询问用户。
- 运行 python3 {baseDir}/scripts/discoversitemaps.py <网站或URL>。
- 用通俗语言总结发现的网站地图清单。
- 如果用户提供了限定范围的URL(例如 https://example.com/docs),则使用发现结果中的 scopehintsubstring 作为默认过滤指引。
- 询问用户想要抓取的内容类型,例如文档、知识库、博客、学院、更新日志或其他类别。
- 根据名称和示例URL模式,将用户请求映射到最相关的网站地图。
- 如果仍有多个网站地图匹配,请用户选择一个或提供更精确的范围。
- 如果缺少目标文件夹,请询问用户。
- 运行 python3 {baseDir}/scripts/scrapesitemap.py --sitemap-url <选定的网站地图> --output-dir <目标文件夹>,如果提供了限定范围的URL,则添加 --include-substring hintsubstring>,除非用户覆盖了范围。
- 报告抓取的内容、保存位置以及任何跳过或失败的页面。
快速命令
发现网站地图清单:
bash
python3 {baseDir}/scripts/discover_sitemaps.py https://example.com
从直接URL提示中发现并保留范围提示:
bash
python3 {baseDir}/scripts/discover_sitemaps.py https://example.com/docs
将单个网站地图抓取到指定文件夹:
bash
python3 {baseDir}/scripts/scrape_sitemap.py \
--sitemap-url https://example.com/docs-sitemap.xml \
--output-dir /tmp/example-docs
当网站地图混合多个部分时,过滤到URL子集:
bash
python3 {baseDir}/scripts/scrape_sitemap.py \
--sitemap-url https://example.com/sitemap.xml \
--output-dir /tmp/example-docs \
--include-substring /docs/ \
--exclude-substring /tag/
选择规则
- - 优先选择明确以请求内容类型命名的网站地图,例如 docs-sitemap.xml、post-sitemap.xml、kb-sitemap.xml 或 academy-sitemap.xml。
- 使用 discover_sitemaps.py 返回的示例URL来解释为什么某个网站地图看起来像文档、博客、帮助中心或其他类别。
- 如果请求范围较宽泛,则提供发现的选项供用户选择,而非默认抓取所有内容。
- 如果不存在网站地图,则停止并询问用户是否希望改用受限的爬取工作流程。不要静默切换策略。
输出约定
- - 每个抓取的页面保存一个Markdown文件。
- 在输出根目录保存 manifest.json,包含成功和失败的详细信息。
- 在Markdown头部保留源URL,以便语料库可追溯。
- 保持从源URL路径派生的稳定文件夹结构。
在将用户意图映射到候选网站地图、处理模糊的网站地图名称或解释输出布局时,请阅读 {baseDir}/references/sitemap-selection.md。
触发示例
- - 将 example.com/docs 的内容抓取到 ./out/docs。
- 从 https://example.com/help 拉取帮助中心页面。
- 查找 example.com 的博客网站地图,仅抓取文章。
安全护栏
- - 仅抓取公共内容。
- 仅接受 http 和 https 目标。
- 拒绝 localhost、私有IP范围以及仅限内部的主机名。
- 在请求时通过主机名解析检查和重定向目标检查,强制仅限公共目标。
- 尊重选定的网站地图范围,而非广泛的站点爬取。
- 避免登录流程、私有仪表盘、购物车、结账路径或用户特定页面。
- 不使用认证头、Cookie或令牌。
- 在写入预期工作区域之外时需询问用户。
- 当JavaScript密集型页面的提取质量较差时,告知用户。捆绑的抓取器以HTML优先,可能遗漏客户端渲染的内容。