docs-scraper
CLI tool that scrapes documents from various sources into local PDF files using browser automation.
Installation
CODEBLOCK0
Quick start
Scrape any document URL to PDF:
CODEBLOCK1
Returns local path: INLINECODE0
Basic scraping
Scrape with daemon (recommended, keeps browser warm):
CODEBLOCK2
Scrape with named profile (for authenticated sites):
CODEBLOCK3
Scrape with pre-filled data (e.g., email for DocSend):
CODEBLOCK4
Direct mode (single-shot, no daemon):
CODEBLOCK5
Authentication workflow
When a document requires authentication (login, email verification, passcode):
- 1. Initial scrape returns a job ID:
CODEBLOCK6
- 2. Retry with data:
CODEBLOCK7
Profile management
Profiles store session cookies for authenticated sites.
CODEBLOCK8
Daemon management
The daemon keeps browser instances warm for faster scraping.
CODEBLOCK9
Note: Daemon auto-starts when running scrape commands.
Cleanup
PDFs are stored in ~/.docs-scraper/output/. The daemon automatically cleans up files older than 1 hour.
Manual cleanup:
CODEBLOCK10
Job management
CODEBLOCK11
Supported sources
- - Direct PDF links - Downloads PDF directly
- Notion pages - Exports Notion page to PDF
- DocSend documents - Handles DocSend viewer
- LLM fallback - Uses Claude API for any other webpage
Scraper Reference
Each scraper accepts specific -D data fields. Use the appropriate fields based on the URL type.
DirectPdfScraper
Handles: URLs ending in INLINECODE3
Data fields: None (downloads directly)
Example:
docs-scraper scrape https://example.com/document.pdf
DocsendScraper
Handles: docsend.com/view/*, docsend.com/v/*, and subdomains (e.g., org-a.docsend.com)
URL patterns:
- - Documents:
https://docsend.com/view/{id} or INLINECODE8 - Folders: INLINECODE9
- Subdomains: INLINECODE10
Data fields:
| Field | Type | Description |
|---|
| INLINECODE11 | email | Email address for document access |
| INLINECODE12 |
password | Passcode/password for protected documents |
|
name | text | Your name (required for NDA-gated documents) |
Examples:
CODEBLOCK13
Notes:
- - DocSend may require any combination of email, password, and name
- Folders are scraped as a table of contents PDF with document links
- The scraper auto-checks NDA checkboxes when name is provided
NotionScraper
Handles: notion.so/*, INLINECODE15
Data fields:
| Field | Type | Description |
|---|
| INLINECODE16 | email | Notion account email |
| INLINECODE17 |
password | Notion account password |
Examples:
CODEBLOCK14
Notes:
- - Public Notion pages don't require authentication
- Toggle blocks are automatically expanded before PDF generation
- Uses session profiles to persist login across scrapes
LlmFallbackScraper
Handles: Any URL not matched by other scrapers (automatic fallback)
Data fields: Dynamic - determined by Claude analyzing the page
The LLM scraper uses Claude to analyze the page HTML and detect:
- - Login forms (extracts field names dynamically)
- Cookie banners (auto-dismisses)
- Expandable content (auto-expands)
- CAPTCHAs (reports as blocked)
- Paywalls (reports as blocked)
Common dynamic fields:
| Field | Type | Description |
|---|
| INLINECODE18 | email | Login email (if detected) |
| INLINECODE19 |
password | Login password (if detected) |
|
username | text | Username (if login uses username) |
Examples:
CODEBLOCK15
Notes:
- - Requires
ANTHROPIC_API_KEY environment variable - Field names are extracted from the page's actual form fields
- Limited to 2 login attempts before failing
- CAPTCHAs require manual intervention
Data field summary
| Scraper | email | password | name | Other |
|---|
| DirectPdf | - | - | - | - |
| DocSend |
✓ | ✓ | ✓ | - |
| Notion | ✓ | ✓ | - | - |
| LLM Fallback | ✓
| ✓ | - | Dynamic* |
*Fields detected dynamically from page analysis
Environment setup (optional)
Only needed for LLM fallback scraper:
CODEBLOCK16
Optional browser settings:
CODEBLOCK17
Common patterns
Archive a Notion page:
CODEBLOCK18
Download protected DocSend:
CODEBLOCK19
Batch scraping with profiles:
CODEBLOCK20
Output
Success: Local file path (e.g., ~/.docs-scraper/output/1706123456-abc123.pdf)
Blocked: Job ID + required credential types
Troubleshooting
- - Timeout: INLINECODE23
- Auth fails:
docs-scraper jobs list to check pending jobs - Disk full:
docs-scraper cleanup to remove old PDFs
docs-scraper
使用浏览器自动化从各种来源抓取文档并保存为本地PDF文件的CLI工具。
安装
bash
npm install -g docs-scraper
快速开始
将任意文档URL抓取为PDF:
bash
docs-scraper scrape https://example.com/document
返回本地路径:~/.docs-scraper/output/1706123456-abc123.pdf
基础抓取
使用守护进程抓取(推荐,保持浏览器预热):
bash
docs-scraper scrape
使用命名配置文件抓取(用于需要认证的站点):
bash
docs-scraper scrape -p
使用预填数据抓取(例如DocSend的邮箱):
bash
docs-scraper scrape -D email=user@example.com
直接模式(单次运行,不使用守护进程):
bash
docs-scraper scrape --no-daemon
认证工作流程
当文档需要认证(登录、邮箱验证、验证码)时:
- 1. 首次抓取返回一个任务ID:
bash
docs-scraper scrape https://docsend.com/view/xxx
# 输出:抓取被阻止
# 任务ID:abc123
- 2. 使用数据重试:
bash
docs-scraper update abc123 -D email=user@example.com
# 或带密码
docs-scraper update abc123 -D email=user@example.com -D password=1234
配置文件管理
配置文件存储认证站点的会话Cookie。
bash
docs-scraper profiles list # 列出已保存的配置文件
docs-scraper profiles clear # 清除所有配置文件
docs-scraper scrape -p myprofile # 使用配置文件
守护进程管理
守护进程保持浏览器实例预热,以实现更快的抓取。
bash
docs-scraper daemon status # 检查状态
docs-scraper daemon start # 手动启动
docs-scraper daemon stop # 停止守护进程
注意:运行抓取命令时守护进程会自动启动。
清理
PDF文件存储在~/.docs-scraper/output/目录中。守护进程会自动清理超过1小时的文件。
手动清理:
bash
docs-scraper cleanup # 删除所有PDF文件
docs-scraper cleanup --older-than 1h # 删除超过1小时的PDF文件
任务管理
bash
docs-scraper jobs list # 列出等待认证的阻塞任务
支持的来源
- - 直接PDF链接 - 直接下载PDF
- Notion页面 - 将Notion页面导出为PDF
- DocSend文档 - 处理DocSend查看器
- LLM回退 - 对其他网页使用Claude API
抓取器参考
每个抓取器接受特定的-D数据字段。根据URL类型使用相应的字段。
DirectPdfScraper
处理: 以.pdf结尾的URL
数据字段: 无(直接下载)
示例:
bash
docs-scraper scrape https://example.com/document.pdf
DocsendScraper
处理: docsend.com/view/、docsend.com/v/以及子域名(例如org-a.docsend.com)
URL模式:
- - 文档:https://docsend.com/view/{id} 或 https://docsend.com/v/{id}
- 文件夹:https://docsend.com/view/s/{id}
- 子域名:https://{subdomain}.docsend.com/view/{id}
数据字段:
| 字段 | 类型 | 描述 |
|---|
| email | 邮箱 | 用于文档访问的邮箱地址 |
| password |
密码 | 受保护文档的密码/验证码 |
| name | 文本 | 您的姓名(NDA限制文档需要) |
示例:
bash
预填DocSend邮箱
docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com
带密码保护
docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com -D password=secret123
带NDA姓名要求
docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com -D name=John Doe
重试阻塞任务
docs-scraper update abc123 -D email=user@example.com -D password=secret123
注意:
- - DocSend可能需要邮箱、密码和姓名的任意组合
- 文件夹会被抓取为包含文档链接的目录PDF
- 提供姓名时,抓取器会自动勾选NDA复选框
NotionScraper
处理: notion.so/、.notion.site/*
数据字段:
| 字段 | 类型 | 描述 |
|---|
| email | 邮箱 | Notion账户邮箱 |
| password |
密码 | Notion账户密码 |
示例:
bash
公开页面(无需认证)
docs-scraper scrape https://notion.so/Public-Page-abc123
需要登录的私有页面
docs-scraper scrape https://notion.so/Private-Page-abc123 \
-D email=user@example.com -D password=mypassword
自定义域名
docs-scraper scrape https://docs.company.notion.site/Page-abc123
注意:
- - 公开Notion页面无需认证
- 切换块会在PDF生成前自动展开
- 使用会话配置文件在多次抓取间保持登录状态
LlmFallbackScraper
处理: 其他抓取器未匹配的任何URL(自动回退)
数据字段: 动态 - 由Claude分析页面后确定
LLM抓取器使用Claude分析页面HTML并检测:
- - 登录表单(动态提取字段名)
- Cookie横幅(自动关闭)
- 可展开内容(自动展开)
- 验证码(报告为被阻止)
- 付费墙(报告为被阻止)
常见动态字段:
| 字段 | 类型 | 描述 |
|---|
| email | 邮箱 | 登录邮箱(如检测到) |
| password |
密码 | 登录密码(如检测到) |
| username | 文本 | 用户名(如登录使用用户名) |
示例:
bash
通用网页(无需认证)
docs-scraper scrape https://example.com/article
需要登录的网页
docs-scraper scrape https://members.example.com/article \
-D email=user@example.com -D password=secret
被阻止时,检查任务所需的字段
docs-scraper jobs list
然后使用抓取器检测到的字段重试
docs-scraper update abc123 -D username=myuser -D password=secret
注意:
- - 需要ANTHROPICAPIKEY环境变量
- 字段名从页面的实际表单字段中提取
- 限制2次登录尝试,失败后停止
- 验证码需要手动干预
数据字段汇总
| 抓取器 | email | password | name | 其他 |
|---|
| DirectPdf | - | - | - | - |
| DocSend |
✓ | ✓ | ✓ | - |
| Notion | ✓ | ✓ | - | - |
| LLM回退 | ✓
| ✓ | - | 动态* |
*字段从页面分析中动态检测
环境设置(可选)
仅LLM回退抓取器需要:
bash
export ANTHROPICAPIKEY=your_key
可选的浏览器设置:
bash
export BROWSER_HEADLESS=true # 设为false进行调试
常见模式
归档Notion页面:
bash
docs-scraper scrape https://notion.so/My-Page-abc123
下载受保护的DocSend:
bash
docs-scraper scrape https://docsend.com/view/xxx
如果被阻止:
docs-scraper update
-D email=user@example.com -D password=1234
使用配置文件批量抓取:
bash
docs-scraper scrape https://site.com/doc1 -p mysite
docs-scraper scrape https://site.com/doc2 -p mysite
输出
成功:本地文件路径(例如~/.docs-scraper/output/1706123456-abc123.pdf)
被阻止:任务ID + 所需凭证类型
故障排除