Parallel Extract

Extract clean, LLM-ready content from URLs. Handles webpages, articles, PDFs, and JavaScript-heavy sites that need rendering.

When to Use

Trigger this skill when the user asks for:

- "read this URL", "fetch this page", "extract from..."
"get the content from [URL]"
"what does this article say?"
Reading PDFs, JS-heavy pages, or paywalled content
Getting clean markdown from messy web pages

Use Search to discover; use Extract to read.

Quick Start

CODEBLOCK0

CLI Reference

Basic Usage

CODEBLOCK1

Common Flags

Flag	Description
INLINECODE0	URL to extract (repeatable, max 10)
INLINECODE1

Examples

Basic extraction:
CODEBLOCK2

Focused extraction:
CODEBLOCK3

Full content for PDFs:
CODEBLOCK4

Multiple URLs:
CODEBLOCK5

Default Workflow

1. Search with an objective + keyword queries
Inspect titles/URLs/dates; choose the best sources
Extract the specific pages you need (top N URLs)
Answer using the extracted excerpts/content

Best-Practice Prompting

Objective

When extracting, provide context:

- What specific information you're looking for
Why you need it (helps focus extraction)

Good: INLINECODE11

Poor: INLINECODE12

Response Format

Returns structured JSON with:

- url — source URL
INLINECODE14 — page title
INLINECODE15 — relevant text excerpts (if enabled)
INLINECODE16 — complete page content (if enabled)
INLINECODE17 — when available

Output Handling

When turning extracted content into a user-facing answer:

- Keep content verbatim — do not paraphrase unnecessarily
Extract ALL list items exhaustively
Strip noise: nav menus, footers, ads, "click here" links
Preserve all facts, names, numbers, dates, quotes
Include URL + publish_date for transparency

Running Out of Context?

For long conversations, save results and use sessions_spawn:

CODEBLOCK6

Then spawn a sub-agent:
CODEBLOCK7

Error Handling

Exit Code	Meaning
0	Success
1

Prerequisites

1. Get an API key at parallel.ai
Install the CLI:

CODEBLOCK8

References

Parallel Extract

从URL中提取干净、可直接用于LLM的内容。支持网页、文章、PDF以及需要渲染的JavaScript密集型网站。

使用场景

当用户提出以下需求时触发此技能：

- 读取这个URL、获取这个页面、从...提取
获取[URL]的内容
这篇文章说了什么？
读取PDF、JS密集型页面或付费内容
从杂乱的网页中获取干净的Markdown格式内容

用搜索发现内容；用提取读取内容。

快速开始

bash
parallel-cli extract https://example.com/article --json

CLI参考

基本用法

bash
parallel-cli extract [options]

常用参数

参数	说明
--url <url>	要提取的URL（可重复，最多10个）
--objective <focus>

示例

基础提取：
bash
parallel-cli extract https://example.com/article --json

聚焦提取：
bash
parallel-cli extract https://example.com/pricing \
--objective 定价层级和功能 \
--json

PDF完整内容：
bash
parallel-cli extract https://example.com/whitepaper.pdf \
--full-content \
--json

多个URL：
bash
parallel-cli extract \
--url https://example.com/page1 \
--url https://example.com/page2 \
--json

默认工作流程

1. 搜索：使用目标+关键词查询
检查：查看标题/URL/日期；选择最佳来源
提取：提取你需要的特定页面（前N个URL）
回答：使用提取的摘录/内容进行回答

最佳实践提示

目标设定

提取时提供上下文：

- 你正在寻找的具体信息
为什么需要这些信息（有助于聚焦提取）

良好示例： --objective 查找安装步骤和系统要求

不佳示例： --objective 阅读页面

响应格式

返回结构化JSON，包含：

- url — 来源URL
title — 页面标题
excerpts[] — 相关文本摘录（如启用）
fullcontent — 完整页面内容（如启用）
publishdate — 发布日期（如有）

输出处理

将提取的内容转化为面向用户的回答时：

- 保持内容原样 — 不要进行不必要的改写
完整提取所有列表项
去除噪音：导航菜单、页脚、广告、点击这里链接
保留所有事实、名称、数字、日期、引用
包含URL + 发布日期以确保透明度

上下文不足？

对于长对话，保存结果并使用sessions_spawn：

bash
parallel-cli extract --json -o /tmp/extract-.json

然后生成子代理：
json
{
tool: sessions_spawn,
task: 读取 /tmp/extract-.json 并总结关键内容。,
label: extract-summary
}

错误处理

退出码	含义
0	成功
1

前置条件

1. 在parallel.ai获取API密钥
安装CLI：

bash
curl -fsSL https://parallel.ai/install.sh | bash
export PARALLELAPIKEY=your-key

parallel-extract并行提取