Webpage Export
Use this skill to turn a webpage URL into local files that downstream agents can archive, send, or reference.
Core workflow
- 1. Run
scripts/export_webpage.py <url> to create a TXT snapshot first. - Treat TXT as the baseline extracted record.
- Add
--docx when the user wants a Word document. - Add
--pdf when Chrome/Chromium is available and the user wants a PDF. - Keep the generated JSON metadata file; it records extraction quality, paths, warnings, and partial-failure status for downstream agents.
- Save outputs to an explicit
--outdir when the user provides one; otherwise let the script use its local default export folder under the current working directory. - For accuracy-sensitive work, keep original title, original URL, and extracted source metadata.
Commands
TXT only
CODEBLOCK0
TXT + DOCX
CODEBLOCK1
TXT + PDF
CODEBLOCK2
TXT + DOCX + PDF with explicit output folder
CODEBLOCK3
Runtime requirements
- - Requires
python3. - Requires
curl for baseline webpage fetching. - PDF export requires Chrome or Chromium.
- Browser-assisted fallback requires
node and the playwright package. - DOCX export on macOS requires
textutil.
Safety and execution notes
- - This skill fetches arbitrary URLs and may use a headless browser for difficult pages.
- Browser-assisted fallback executes page JavaScript and should be used only when needed.
- Prefer explicit
--outdir values for production or shared environments.
What the script does
- - Fetch the page with INLINECODE10
- Extract title/source/publish-time when available
- Try multiple body candidates before falling back to a full-page text snapshot
- Score extraction quality and emit warnings for suspicious/partial results
- Strip HTML into readable text for a TXT snapshot
- Convert TXT to DOCX using
textutil on macOS - Render webpage to PDF using Chrome/Chromium headless printing when available
- Emit a JSON metadata file with status, paths, word count, quality, and warnings
Format choice
- - Prefer TXT as the baseline extracted record.
- Prefer DOCX when the user wants an editable or shareable document.
- Prefer PDF when the user wants page-like rendering or easier direct viewing.
- For important work, do not treat PDF as the only source of truth.
Chrome/Chromium PDF path
When the user wants PDF, prefer Chrome/Chromium headless printing because it preserves Chinese text and webpage layout better than ad-hoc PDF generation.
Read references/chrome-pdf-guide.md when:
- - you need the exact Chrome PDF logic
- PDF output is incomplete or suspicious
- Chrome emits warnings and you need to judge whether the result is still usable
- you need fallback decisions
Accuracy and fallbacks
Read references/accuracy-and-fallbacks.md when:
- - source accuracy matters
- webpage metadata is incomplete
- a field cannot be extracted cleanly
- you need fallback behavior after a partial extraction
Delivery decisions
Read references/delivery-rules.md when:
- - deciding whether to deliver TXT, DOCX, PDF, or a combination
- preparing files for downstream agents or user delivery
- choosing archive placement under the local workspace
Limitations
- - Some highly dynamic or anti-bot pages may extract only partially.
- PDF depends on Chrome/Chromium being installed.
- DOCX depends on macOS
textutil. - If a page is blocked in lightweight fetch mode, use this skill's curl-based extraction path before giving up.
Accuracy rule
Accuracy is the top standard. Keep original title, original URL, and extracted source metadata. If any field is uncertain, mark it as missing instead of guessing.
网页导出
使用此技能可将网页URL转换为本地文件,供下游代理归档、发送或引用。
核心工作流程
- 1. 首先运行 scripts/export_webpage.py 创建TXT快照。
- 将TXT作为基线提取记录。
- 当用户需要Word文档时添加 --docx 参数。
- 当Chrome/Chromium可用且用户需要PDF时添加 --pdf 参数。
- 保留生成的JSON元数据文件;该文件记录提取质量、路径、警告和部分失败状态,供下游代理使用。
- 当用户提供 --outdir 时,将输出保存到指定目录;否则让脚本使用当前工作目录下的本地默认导出文件夹。
- 对于精度敏感的工作,保留原始标题、原始URL和提取的源元数据。
命令
仅TXT
bash
python3 scripts/export_webpage.py
TXT + DOCX
bash
python3 scripts/export_webpage.py --docx
TXT + PDF
bash
python3 scripts/export_webpage.py --pdf
TXT + DOCX + PDF 并指定输出文件夹
bash
python3 scripts/export_webpage.py --docx --pdf --outdir ./exports/temp
运行时要求
- - 需要 python3。
- 需要 curl 用于基线网页获取。
- PDF导出需要Chrome或Chromium。
- 浏览器辅助回退需要 node 和 playwright 包。
- macOS上的DOCX导出需要 textutil。
安全与执行说明
- - 此技能会获取任意URL,并可能对复杂页面使用无头浏览器。
- 浏览器辅助回退会执行页面JavaScript,仅在必要时使用。
- 在生产或共享环境中,优先使用显式的 --outdir 值。
脚本功能
- - 使用 curl 获取页面
- 提取标题/来源/发布时间(如可用)
- 在回退到全页文本快照前尝试多个正文候选
- 对提取质量进行评分,并对可疑/部分结果发出警告
- 将HTML转换为可读文本以生成TXT快照
- 在macOS上使用 textutil 将TXT转换为DOCX
- 当Chrome/Chromium可用时,使用无头打印将网页渲染为PDF
- 生成包含状态、路径、字数、质量和警告的JSON元数据文件
格式选择
- - 优先选择 TXT 作为基线提取记录。
- 当用户需要可编辑或可共享的文档时,优先选择 DOCX。
- 当用户需要类似页面的渲染或更易直接查看时,优先选择 PDF。
- 对于重要工作,不要将PDF视为唯一的事实来源。
Chrome/Chromium PDF路径
当用户需要PDF时,优先使用Chrome/Chromium无头打印,因为它比临时生成的PDF更好地保留中文文本和网页布局。
在以下情况下阅读 references/chrome-pdf-guide.md:
- - 需要确切的Chrome PDF逻辑
- PDF输出不完整或可疑
- Chrome发出警告,需要判断结果是否仍可用
- 需要回退决策
精度与回退
在以下情况下阅读 references/accuracy-and-fallbacks.md:
- - 源精度很重要
- 网页元数据不完整
- 无法干净地提取某个字段
- 部分提取后需要回退行为
交付决策
在以下情况下阅读 references/delivery-rules.md:
- - 决定是否交付TXT、DOCX、PDF或其组合
- 为下游代理或用户交付准备文件
- 选择本地工作区下的归档位置
限制
- - 某些高度动态或反爬虫页面可能只能部分提取。
- PDF依赖于已安装的Chrome/Chromium。
- DOCX依赖于macOS的 textutil。
- 如果页面在轻量级获取模式下被阻止,在放弃前使用此技能的基于curl的提取路径。
精度规则
精度是最高标准。保留原始标题、原始URL和提取的源元数据。如果任何字段不确定,标记为缺失而非猜测。