Parallel Extract
Extract clean, LLM-ready content from URLs. Handles webpages, articles, PDFs, and JavaScript-heavy sites that need rendering.
When to Use
Trigger this skill when the user asks for:
- - "read this URL", "fetch this page", "extract from..."
- "get the content from [URL]"
- "what does this article say?"
- Reading PDFs, JS-heavy pages, or paywalled content
- Getting clean markdown from messy web pages
Use Search to discover; use Extract to read.
Quick Start
CODEBLOCK0
CLI Reference
Basic Usage
CODEBLOCK1
Common Flags
| Flag | Description |
|---|
| INLINECODE0 | URL to extract (repeatable, max 10) |
| INLINECODE1 |
Focus extraction on specific content |
|
--json | Output as JSON |
|
--excerpts /
--no-excerpts | Include relevant excerpts (default: on) |
|
--full-content /
--no-full-content | Include full page content |
|
--excerpts-max-chars N | Max chars per excerpt |
|
--excerpts-max-total-chars N | Max total excerpt chars |
|
--full-max-chars N | Max full content chars |
|
-o <file> | Save output to file |
Examples
Basic extraction:
CODEBLOCK2
Focused extraction:
CODEBLOCK3
Full content for PDFs:
CODEBLOCK4
Multiple URLs:
CODEBLOCK5
Default Workflow
- 1. Search with an objective + keyword queries
- Inspect titles/URLs/dates; choose the best sources
- Extract the specific pages you need (top N URLs)
- Answer using the extracted excerpts/content
Best-Practice Prompting
Objective
When extracting, provide context:
- - What specific information you're looking for
- Why you need it (helps focus extraction)
Good: INLINECODE11
Poor: INLINECODE12
Response Format
Returns structured JSON with:
- -
url — source URL - INLINECODE14 — page title
- INLINECODE15 — relevant text excerpts (if enabled)
- INLINECODE16 — complete page content (if enabled)
- INLINECODE17 — when available
Output Handling
When turning extracted content into a user-facing answer:
- - Keep content verbatim — do not paraphrase unnecessarily
- Extract ALL list items exhaustively
- Strip noise: nav menus, footers, ads, "click here" links
- Preserve all facts, names, numbers, dates, quotes
- Include URL + publish_date for transparency
Running Out of Context?
For long conversations, save results and use sessions_spawn:
CODEBLOCK6
Then spawn a sub-agent:
CODEBLOCK7
Error Handling
| Exit Code | Meaning |
|---|
| 0 | Success |
| 1 |
Unexpected error (network, parse) |
| 2 | Invalid arguments |
| 3 | API error (non-2xx) |
Prerequisites
- 1. Get an API key at parallel.ai
- Install the CLI:
CODEBLOCK8
References
Parallel Extract
从URL中提取干净、可直接用于LLM的内容。支持网页、文章、PDF以及需要渲染的JavaScript密集型网站。
使用场景
当用户提出以下需求时触发此技能:
- - 读取这个URL、获取这个页面、从...提取
- 获取[URL]的内容
- 这篇文章说了什么?
- 读取PDF、JS密集型页面或付费内容
- 从杂乱的网页中获取干净的Markdown格式内容
用搜索发现内容;用提取读取内容。
快速开始
bash
parallel-cli extract https://example.com/article --json
CLI参考
基本用法
bash
parallel-cli extract [options]
常用参数
| 参数 | 说明 |
|---|
| --url <url> | 要提取的URL(可重复,最多10个) |
| --objective <focus> |
聚焦提取特定内容 |
| --json | 输出为JSON格式 |
| --excerpts / --no-excerpts | 包含相关摘录(默认:开启) |
| --full-content / --no-full-content | 包含完整页面内容 |
| --excerpts-max-chars N | 每条摘录最大字符数 |
| --excerpts-max-total-chars N | 摘录总最大字符数 |
| --full-max-chars N | 完整内容最大字符数 |
| -o
| 将输出保存到文件 |
示例
基础提取:
bash
parallel-cli extract https://example.com/article --json
聚焦提取:
bash
parallel-cli extract https://example.com/pricing \
--objective 定价层级和功能 \
--json
PDF完整内容:
bash
parallel-cli extract https://example.com/whitepaper.pdf \
--full-content \
--json
多个URL:
bash
parallel-cli extract \
--url https://example.com/page1 \
--url https://example.com/page2 \
--json
默认工作流程
- 1. 搜索:使用目标+关键词查询
- 检查:查看标题/URL/日期;选择最佳来源
- 提取:提取你需要的特定页面(前N个URL)
- 回答:使用提取的摘录/内容进行回答
最佳实践提示
目标设定
提取时提供上下文:
- - 你正在寻找的具体信息
- 为什么需要这些信息(有助于聚焦提取)
良好示例: --objective 查找安装步骤和系统要求
不佳示例: --objective 阅读页面
响应格式
返回结构化JSON,包含:
- - url — 来源URL
- title — 页面标题
- excerpts[] — 相关文本摘录(如启用)
- fullcontent — 完整页面内容(如启用)
- publishdate — 发布日期(如有)
输出处理
将提取的内容转化为面向用户的回答时:
- - 保持内容原样 — 不要进行不必要的改写
- 完整提取所有列表项
- 去除噪音:导航菜单、页脚、广告、点击这里链接
- 保留所有事实、名称、数字、日期、引用
- 包含URL + 发布日期以确保透明度
上下文不足?
对于长对话,保存结果并使用sessions_spawn:
bash
parallel-cli extract --json -o /tmp/extract-.json
然后生成子代理:
json
{
tool: sessions_spawn,
task: 读取 /tmp/extract-.json 并总结关键内容。,
label: extract-summary
}
错误处理
意外错误(网络、解析) |
| 2 | 无效参数 |
| 3 | API错误(非2xx状态码) |
前置条件
- 1. 在parallel.ai获取API密钥
- 安装CLI:
bash
curl -fsSL https://parallel.ai/install.sh | bash
export PARALLELAPIKEY=your-key
参考文档