docs-scraper

CLI tool that scrapes documents from various sources into local PDF files using browser automation.

Installation

CODEBLOCK0

Quick start

Scrape any document URL to PDF:

CODEBLOCK1

Returns local path: INLINECODE0

Basic scraping

Scrape with daemon (recommended, keeps browser warm):
CODEBLOCK2

Scrape with named profile (for authenticated sites):
CODEBLOCK3

Scrape with pre-filled data (e.g., email for DocSend):
CODEBLOCK4

Direct mode (single-shot, no daemon):
CODEBLOCK5

Authentication workflow

When a document requires authentication (login, email verification, passcode):

1. Initial scrape returns a job ID:

CODEBLOCK6

2. Retry with data:

CODEBLOCK7

Profile management

Profiles store session cookies for authenticated sites.

CODEBLOCK8

Daemon management

The daemon keeps browser instances warm for faster scraping.

CODEBLOCK9

Note: Daemon auto-starts when running scrape commands.

Cleanup

PDFs are stored in ~/.docs-scraper/output/. The daemon automatically cleans up files older than 1 hour.

Manual cleanup:
CODEBLOCK10

Job management

CODEBLOCK11

Supported sources

- Direct PDF links - Downloads PDF directly
Notion pages - Exports Notion page to PDF
DocSend documents - Handles DocSend viewer
LLM fallback - Uses Claude API for any other webpage

Scraper Reference

Each scraper accepts specific -D data fields. Use the appropriate fields based on the URL type.

DirectPdfScraper

Handles: URLs ending in INLINECODE3

Data fields: None (downloads directly)

Example:

docs-scraper scrape https://example.com/document.pdf

DocsendScraper

Handles: docsend.com/view/*, docsend.com/v/*, and subdomains (e.g., org-a.docsend.com)

URL patterns:

- Documents: https://docsend.com/view/{id} or INLINECODE8
Folders: INLINECODE9
Subdomains: INLINECODE10

Data fields:

Field	Type	Description
INLINECODE11	email	Email address for document access
INLINECODE12

Examples:
CODEBLOCK13

Notes:

- DocSend may require any combination of email, password, and name
Folders are scraped as a table of contents PDF with document links
The scraper auto-checks NDA checkboxes when name is provided

NotionScraper

Handles: notion.so/*, INLINECODE15

Data fields:

Field	Type	Description
INLINECODE16	email	Notion account email
INLINECODE17

password | Notion account password |

Examples:
CODEBLOCK14

Notes:

- Public Notion pages don't require authentication
Toggle blocks are automatically expanded before PDF generation
Uses session profiles to persist login across scrapes

LlmFallbackScraper

Handles: Any URL not matched by other scrapers (automatic fallback)

Data fields: Dynamic - determined by Claude analyzing the page

The LLM scraper uses Claude to analyze the page HTML and detect:

- Login forms (extracts field names dynamically)
Cookie banners (auto-dismisses)
Expandable content (auto-expands)
CAPTCHAs (reports as blocked)
Paywalls (reports as blocked)

Common dynamic fields:

Field	Type	Description
INLINECODE18	email	Login email (if detected)
INLINECODE19

Examples:
CODEBLOCK15

Notes:

- Requires ANTHROPIC_API_KEY environment variable
Field names are extracted from the page's actual form fields
Limited to 2 login attempts before failing
CAPTCHAs require manual intervention

Data field summary

Scraper	email	password	name	Other
DirectPdf	-	-	-	-
DocSend

✓ | ✓ | ✓ | - |
| Notion | ✓ | ✓ | - | - |
| LLM Fallback | ✓ | ✓ | - | Dynamic* |

*Fields detected dynamically from page analysis

Environment setup (optional)

Only needed for LLM fallback scraper:

CODEBLOCK16

Optional browser settings:
CODEBLOCK17

Common patterns

Archive a Notion page:
CODEBLOCK18

Download protected DocSend:
CODEBLOCK19

Batch scraping with profiles:
CODEBLOCK20

Output

Success: Local file path (e.g., ~/.docs-scraper/output/1706123456-abc123.pdf)
Blocked: Job ID + required credential types

Troubleshooting

- Timeout: INLINECODE23
Auth fails: docs-scraper jobs list to check pending jobs
Disk full: docs-scraper cleanup to remove old PDFs

docs-scraper

使用浏览器自动化从各种来源抓取文档并保存为本地PDF文件的CLI工具。

安装

bash
npm install -g docs-scraper

快速开始

将任意文档URL抓取为PDF：

bash
docs-scraper scrape https://example.com/document

返回本地路径：~/.docs-scraper/output/1706123456-abc123.pdf

基础抓取

使用守护进程抓取（推荐，保持浏览器预热）：
bash
docs-scraper scrape

使用命名配置文件抓取（用于需要认证的站点）：
bash
docs-scraper scrape -p

使用预填数据抓取（例如DocSend的邮箱）：
bash
docs-scraper scrape -D email=user@example.com

直接模式（单次运行，不使用守护进程）：
bash
docs-scraper scrape --no-daemon

认证工作流程

当文档需要认证（登录、邮箱验证、验证码）时：

1. 首次抓取返回一个任务ID：

bash docs-scraper scrape https://docsend.com/view/xxx # 输出：抓取被阻止 # 任务ID：abc123

2. 使用数据重试：

bash docs-scraper update abc123 -D email=user@example.com # 或带密码 docs-scraper update abc123 -D email=user@example.com -D password=1234

配置文件管理

配置文件存储认证站点的会话Cookie。

bash
docs-scraper profiles list # 列出已保存的配置文件
docs-scraper profiles clear # 清除所有配置文件
docs-scraper scrape -p myprofile # 使用配置文件

守护进程管理

守护进程保持浏览器实例预热，以实现更快的抓取。

bash
docs-scraper daemon status # 检查状态
docs-scraper daemon start # 手动启动
docs-scraper daemon stop # 停止守护进程

注意：运行抓取命令时守护进程会自动启动。

清理

PDF文件存储在~/.docs-scraper/output/目录中。守护进程会自动清理超过1小时的文件。

手动清理：
bash
docs-scraper cleanup # 删除所有PDF文件
docs-scraper cleanup --older-than 1h # 删除超过1小时的PDF文件

任务管理

bash
docs-scraper jobs list # 列出等待认证的阻塞任务

支持的来源

- 直接PDF链接 - 直接下载PDF
Notion页面 - 将Notion页面导出为PDF
DocSend文档 - 处理DocSend查看器
LLM回退 - 对其他网页使用Claude API

抓取器参考

每个抓取器接受特定的-D数据字段。根据URL类型使用相应的字段。

DirectPdfScraper

处理： 以.pdf结尾的URL

数据字段： 无（直接下载）

示例：
bash
docs-scraper scrape https://example.com/document.pdf

DocsendScraper

处理： docsend.com/view/、docsend.com/v/以及子域名（例如org-a.docsend.com）

URL模式：

- 文档：https://docsend.com/view/{id} 或 https://docsend.com/v/{id}
文件夹：https://docsend.com/view/s/{id}
子域名：https://{subdomain}.docsend.com/view/{id}

数据字段：

字段	类型	描述
email	邮箱	用于文档访问的邮箱地址
password

示例：
bash

预填DocSend邮箱

docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com

带密码保护

docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com -D password=secret123

带NDA姓名要求

docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com -D name=John Doe

重试阻塞任务

docs-scraper update abc123 -D email=user@example.com -D password=secret123

注意：

- DocSend可能需要邮箱、密码和姓名的任意组合
文件夹会被抓取为包含文档链接的目录PDF
提供姓名时，抓取器会自动勾选NDA复选框

NotionScraper

处理： notion.so/、.notion.site/*

数据字段：

字段	类型	描述
email	邮箱	Notion账户邮箱
password

密码 | Notion账户密码 |

示例：
bash

公开页面（无需认证）

docs-scraper scrape https://notion.so/Public-Page-abc123

需要登录的私有页面

docs-scraper scrape https://notion.so/Private-Page-abc123 \ -D email=user@example.com -D password=mypassword

自定义域名

docs-scraper scrape https://docs.company.notion.site/Page-abc123

注意：

- 公开Notion页面无需认证
切换块会在PDF生成前自动展开
使用会话配置文件在多次抓取间保持登录状态

LlmFallbackScraper

处理： 其他抓取器未匹配的任何URL（自动回退）

数据字段： 动态 - 由Claude分析页面后确定

LLM抓取器使用Claude分析页面HTML并检测：

- 登录表单（动态提取字段名）
Cookie横幅（自动关闭）
可展开内容（自动展开）
验证码（报告为被阻止）
付费墙（报告为被阻止）

常见动态字段：

字段	类型	描述
email	邮箱	登录邮箱（如检测到）
password

示例：
bash

通用网页（无需认证）

docs-scraper scrape https://example.com/article

需要登录的网页

docs-scraper scrape https://members.example.com/article \ -D email=user@example.com -D password=secret

被阻止时，检查任务所需的字段

docs-scraper jobs list

然后使用抓取器检测到的字段重试

docs-scraper update abc123 -D username=myuser -D password=secret

注意：

- 需要ANTHROPICAPIKEY环境变量
字段名从页面的实际表单字段中提取
限制2次登录尝试，失败后停止
验证码需要手动干预

数据字段汇总

抓取器	email	password	name	其他
DirectPdf	-	-	-	-
DocSend

✓ | ✓ | ✓ | - |
| Notion | ✓ | ✓ | - | - |
| LLM回退 | ✓ | ✓ | - | 动态* |

*字段从页面分析中动态检测

环境设置（可选）

仅LLM回退抓取器需要：

bash
export ANTHROPICAPIKEY=your_key

可选的浏览器设置：
bash
export BROWSER_HEADLESS=true # 设为false进行调试

常见模式

归档Notion页面：
bash
docs-scraper scrape https://notion.so/My-Page-abc123

下载受保护的DocSend：
bash
docs-scraper scrape https://docsend.com/view/xxx

如果被阻止：

docs-scraper update -D email=user@example.com -D password=1234

使用配置文件批量抓取：
bash
docs-scraper scrape https://site.com/doc1 -p mysite
docs-scraper scrape https://site.com/doc2 -p mysite

输出

成功：本地文件路径（例如~/.docs-scraper/output/1706123456-abc123.pdf）
被阻止：任务ID + 所需凭证类型

故障排除

- 超

scraper文档抓取工具

scraper

docs-scraper

Installation

Quick start

Basic scraping

Authentication workflow

Profile management

Daemon management

Cleanup

Job management

Supported sources

Scraper Reference

DirectPdfScraper

DocsendScraper

NotionScraper

LlmFallbackScraper

Data field summary

Environment setup (optional)

Common patterns

Output

Troubleshooting

docs-scraper

安装

快速开始

基础抓取

认证工作流程

配置文件管理

守护进程管理

清理

任务管理

支持的来源

抓取器参考

DirectPdfScraper

DocsendScraper

预填DocSend邮箱

带密码保护

带NDA姓名要求

重试阻塞任务

NotionScraper

公开页面（无需认证）

需要登录的私有页面

自定义域名

LlmFallbackScraper

通用网页（无需认证）

需要登录的网页

被阻止时，检查任务所需的字段

然后使用抓取器检测到的字段重试

数据字段汇总

环境设置（可选）

常见模式

如果被阻止：

输出

故障排除

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement