Webpage Export

Use this skill to turn a webpage URL into local files that downstream agents can archive, send, or reference.

Core workflow

1. Run scripts/export_webpage.py <url> to create a TXT snapshot first.
Treat TXT as the baseline extracted record.
Add --docx when the user wants a Word document.
Add --pdf when Chrome/Chromium is available and the user wants a PDF.
Keep the generated JSON metadata file; it records extraction quality, paths, warnings, and partial-failure status for downstream agents.
Save outputs to an explicit --outdir when the user provides one; otherwise let the script use its local default export folder under the current working directory.
For accuracy-sensitive work, keep original title, original URL, and extracted source metadata.

Commands

TXT only

CODEBLOCK0

TXT + DOCX

CODEBLOCK1

TXT + PDF

CODEBLOCK2

TXT + DOCX + PDF with explicit output folder

CODEBLOCK3

Runtime requirements

- Requires python3.
Requires curl for baseline webpage fetching.
PDF export requires Chrome or Chromium.
Browser-assisted fallback requires node and the playwright package.
DOCX export on macOS requires textutil.

Safety and execution notes

- This skill fetches arbitrary URLs and may use a headless browser for difficult pages.
Browser-assisted fallback executes page JavaScript and should be used only when needed.
Prefer explicit --outdir values for production or shared environments.

What the script does

- Fetch the page with INLINECODE10
Extract title/source/publish-time when available
Try multiple body candidates before falling back to a full-page text snapshot
Score extraction quality and emit warnings for suspicious/partial results
Strip HTML into readable text for a TXT snapshot
Convert TXT to DOCX using textutil on macOS
Render webpage to PDF using Chrome/Chromium headless printing when available
Emit a JSON metadata file with status, paths, word count, quality, and warnings

Format choice

- Prefer TXT as the baseline extracted record.
Prefer DOCX when the user wants an editable or shareable document.
Prefer PDF when the user wants page-like rendering or easier direct viewing.
For important work, do not treat PDF as the only source of truth.

Chrome/Chromium PDF path

When the user wants PDF, prefer Chrome/Chromium headless printing because it preserves Chinese text and webpage layout better than ad-hoc PDF generation.

Read references/chrome-pdf-guide.md when:

- you need the exact Chrome PDF logic
PDF output is incomplete or suspicious
Chrome emits warnings and you need to judge whether the result is still usable
you need fallback decisions

Accuracy and fallbacks

Read references/accuracy-and-fallbacks.md when:

- source accuracy matters
webpage metadata is incomplete
a field cannot be extracted cleanly
you need fallback behavior after a partial extraction

Delivery decisions

Read references/delivery-rules.md when:

- deciding whether to deliver TXT, DOCX, PDF, or a combination
preparing files for downstream agents or user delivery
choosing archive placement under the local workspace

Limitations

- Some highly dynamic or anti-bot pages may extract only partially.
PDF depends on Chrome/Chromium being installed.
DOCX depends on macOS textutil.
If a page is blocked in lightweight fetch mode, use this skill's curl-based extraction path before giving up.

Accuracy rule

Accuracy is the top standard. Keep original title, original URL, and extracted source metadata. If any field is uncertain, mark it as missing instead of guessing.

网页导出

使用此技能可将网页URL转换为本地文件，供下游代理归档、发送或引用。

核心工作流程

1. 首先运行 scripts/export_webpage.py 创建TXT快照。
将TXT作为基线提取记录。
当用户需要Word文档时添加 --docx 参数。
当Chrome/Chromium可用且用户需要PDF时添加 --pdf 参数。
保留生成的JSON元数据文件；该文件记录提取质量、路径、警告和部分失败状态，供下游代理使用。
当用户提供 --outdir 时，将输出保存到指定目录；否则让脚本使用当前工作目录下的本地默认导出文件夹。
对于精度敏感的工作，保留原始标题、原始URL和提取的源元数据。

命令

仅TXT

bash
python3 scripts/export_webpage.py

TXT + DOCX

bash
python3 scripts/export_webpage.py --docx

TXT + PDF

bash
python3 scripts/export_webpage.py --pdf

TXT + DOCX + PDF 并指定输出文件夹

bash
python3 scripts/export_webpage.py --docx --pdf --outdir ./exports/temp

运行时要求

- 需要 python3。
需要 curl 用于基线网页获取。
PDF导出需要Chrome或Chromium。
浏览器辅助回退需要 node 和 playwright 包。
macOS上的DOCX导出需要 textutil。

安全与执行说明

- 此技能会获取任意URL，并可能对复杂页面使用无头浏览器。
浏览器辅助回退会执行页面JavaScript，仅在必要时使用。
在生产或共享环境中，优先使用显式的 --outdir 值。

脚本功能

- 使用 curl 获取页面
提取标题/来源/发布时间（如可用）
在回退到全页文本快照前尝试多个正文候选
对提取质量进行评分，并对可疑/部分结果发出警告
将HTML转换为可读文本以生成TXT快照
在macOS上使用 textutil 将TXT转换为DOCX
当Chrome/Chromium可用时，使用无头打印将网页渲染为PDF
生成包含状态、路径、字数、质量和警告的JSON元数据文件

格式选择

- 优先选择 TXT 作为基线提取记录。
当用户需要可编辑或可共享的文档时，优先选择 DOCX。
当用户需要类似页面的渲染或更易直接查看时，优先选择 PDF。
对于重要工作，不要将PDF视为唯一的事实来源。

Chrome/Chromium PDF路径

当用户需要PDF时，优先使用Chrome/Chromium无头打印，因为它比临时生成的PDF更好地保留中文文本和网页布局。

在以下情况下阅读 references/chrome-pdf-guide.md：

- 需要确切的Chrome PDF逻辑
PDF输出不完整或可疑
Chrome发出警告，需要判断结果是否仍可用
需要回退决策

精度与回退

在以下情况下阅读 references/accuracy-and-fallbacks.md：

- 源精度很重要
网页元数据不完整
无法干净地提取某个字段
部分提取后需要回退行为

交付决策

在以下情况下阅读 references/delivery-rules.md：

- 决定是否交付TXT、DOCX、PDF或其组合
为下游代理或用户交付准备文件
选择本地工作区下的归档位置

限制

- 某些高度动态或反爬虫页面可能只能部分提取。
PDF依赖于已安装的Chrome/Chromium。
DOCX依赖于macOS的 textutil。
如果页面在轻量级获取模式下被阻止，在放弃前使用此技能的基于curl的提取路径。

精度规则

精度是最高标准。保留原始标题、原始URL和提取的源元数据。如果任何字段不确定，标记为缺失而非猜测。

webpage-export网页导出

webpage-export

Webpage Export

Core workflow

Commands

TXT only

TXT + DOCX

TXT + PDF

TXT + DOCX + PDF with explicit output folder

Runtime requirements

Safety and execution notes

What the script does

Format choice

Chrome/Chromium PDF path

Accuracy and fallbacks

Delivery decisions

Limitations

Accuracy rule

网页导出

核心工作流程

命令

仅TXT

TXT + DOCX

TXT + PDF

TXT + DOCX + PDF 并指定输出文件夹

运行时要求

安全与执行说明

脚本功能

格式选择

Chrome/Chromium PDF路径

精度与回退

交付决策

限制

精度规则

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

webpage-export网页导出

webpage-export

Webpage Export

Core workflow

Commands

TXT only

TXT + DOCX

TXT + PDF

TXT + DOCX + PDF with explicit output folder

Runtime requirements

Safety and execution notes

What the script does

Format choice

Chrome/Chromium PDF path

Accuracy and fallbacks

Delivery decisions

Limitations

Accuracy rule

网页导出

核心工作流程

命令

仅TXT

TXT + DOCX

TXT + PDF

TXT + DOCX + PDF 并指定输出文件夹

运行时要求

安全与执行说明

脚本功能

格式选择

Chrome/Chromium PDF路径

精度与回退

交付决策

限制

精度规则

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement