Anakin - Web Data Extraction

Convert websites into clean data at scale using the anakin-cli. Supports single URL scraping, batch scraping, AI-powered search, and autonomous deep research.

Installation & Authentication

Check status and authentication:

CODEBLOCK0

Output when ready:
CODEBLOCK1

If not installed: INLINECODE0

Always refer to the installation rules in rules/install.md for more information if the user is not logged in.

If not authenticated, run:

CODEBLOCK2

Get your API key from anakin.io/dashboard.

Organization

Create a .anakin/ folder in the working directory unless it already exists to store results. Add .anakin/ to the .gitignore file if not already there. Always use -o to write directly to file (avoids flooding context):

CODEBLOCK3

Capabilities

1. Scrape a Single URL

Extract content from a single web page in multiple formats.

When to use:

- Extracting content from a single web page
Converting a webpage to clean markdown
Extracting structured data from one URL
Getting full raw API response with metadata

Basic usage:

CODEBLOCK4

Advanced options:

CODEBLOCK5

2. Batch Scrape Multiple URLs

Scrape up to 10 URLs at once for efficient parallel processing.

When to use:

- Scraping multiple web pages simultaneously
Comparing products across different sites
Collecting multiple articles or pages
Gathering data from several sources at once

Basic usage:

CODEBLOCK6

For large lists (>10 URLs):

CODEBLOCK7

Output format: JSON file with combined results, each URL's status (success/failure), content, metadata, and any errors.

3. AI-Powered Web Search

Run intelligent web searches to find pages, answer questions, and discover sources.

When to use:

- Finding pages on a specific topic
Answering questions with web sources
Discovering relevant sources for research
Gathering links before scraping specific pages
Quick factual lookups

Basic usage:

CODEBLOCK8

Follow-up workflow:

CODEBLOCK9

Output format: JSON file with search results including titles, URLs, snippets, relevance scores, and metadata.

4. Deep Agentic Research

Run comprehensive autonomous research that explores the web and returns detailed reports.

When to use:

- Comprehensive research on complex topics
Market analysis requiring multiple sources
Technical deep-dives across documentation and articles
Comparison research (products, technologies, approaches)
Questions requiring synthesis from many sources

Basic usage:

CODEBLOCK10

⏱️ Important: Deep research takes 1-5 minutes and runs autonomously. Always inform the user about this duration before starting.

What it does:

- Autonomously searches for relevant sources
Scrapes and analyzes multiple pages
Synthesizes information across sources
Generates comprehensive reports with citations
Provides key insights and conclusions

Output format: JSON file with executive summary, detailed report by subtopics, key insights, citations with URLs, confidence scores, and related topics.

Decision Guide

Use anakin scrape when:

- You have a single specific URL to extract
You need content in markdown, JSON, or raw format
The page is static or JavaScript-heavy (use --browser)

Use anakin scrape-batch when:

- You have 2-10 URLs to scrape simultaneously
You need efficient parallel processing
You want combined results in one file

Use anakin search when:

- You need to find relevant URLs first
You want quick factual lookups
You need results in under 30 seconds
You know what you're looking for

Use anakin research when:

- You need comprehensive analysis across 5+ sources
The topic is complex and requires deep exploration
You want a synthesized report with insights
You can wait 1-5 minutes for autonomous research
The question requires comparing multiple perspectives

Guardrails

URL Handling

- Always quote URLs to prevent shell interpretation of ?, &, # characters
Example: anakin scrape "https://example.com?param=value" not INLINECODE14

Output Management

- Always use -o <file> to save output to a file rather than flooding the terminal
Choose appropriate output filenames based on content type

Format Selection

- Default to markdown for readability unless user explicitly asks for JSON or raw
Use --format json for structured data processing
Use --format raw for full API response with HTML

Special Cases

- Use --browser only when standard scrape returns empty or incomplete content
For batch scraping: Maximum 10 URLs per command — split larger lists
For research: Always warn about 1-5 minute duration before starting

Rate Limiting

- On HTTP 429 errors (rate limit), wait before retrying
Do not loop immediately on rate limit errors

Authentication

- On HTTP 401 errors, re-run anakin login rather than retrying the same command

Error Handling

Error	Solution
HTTP 401 (Unauthorized)	Re-run INLINECODE20
HTTP 429 (Rate Limited)

Output Formats

Markdown (default for scrape)

- Clean, readable text stripped of navigation and ads
Best for human reading and summarization
File extension: INLINECODE26

JSON (structured)

- Structured data with title, content, metadata
Best for processing and parsing
File extension: INLINECODE27

Raw (full response)

- Full API response including HTML, links, images, metadata
Best for debugging or accessing all available data
File extension: INLINECODE28

Examples

Example 1: Article extraction

CODEBLOCK11

Example 2: Product comparison

CODEBLOCK12

Example 3: Find and scrape

CODEBLOCK13

Example 4: Market research

CODEBLOCK14

Example 5: JavaScript-heavy site

CODEBLOCK15

Example 6: Geo-targeted content

CODEBLOCK16

Best Practices

1. Start simple: Try basic scrape first, add flags only if needed
Be specific: Use clear, specific search queries and research topics
Quote URLs: Always wrap URLs in quotes
Save output: Always use -o flag to save results to files
Check status: Run anakin status before starting work
Batch wisely: Group similar URLs together, max 10 per batch
Wait on rate limits: Don't retry immediately on 429 errors
Choose the right tool:

- Single page → scrape - Multiple pages → scrape-batch - Don't have URLs → search first - Need deep analysis → INLINECODE34

Troubleshooting

Authentication issues

CODEBLOCK17

Empty or incomplete content

- Add --browser flag for JavaScript-heavy sites
Increase timeout with INLINECODE36
Check if the site requires specific geo-location with INLINECODE37

Rate limiting

- Wait before retrying (don't loop immediately)
Consider spacing out requests for large batch operations
Check your API plan limits at anakin.io/dashboard

Resources

- Anakin Website
Anakin Dashboard - Get API keys and check usage
anakin-cli on PyPI
Support

Anakin - 网页数据提取

使用 anakin-cli 将网站大规模转换为干净的数据。支持单 URL 抓取、批量抓取、AI 驱动的搜索和自主深度研究。

安装与身份验证

检查状态和身份验证：

bash
anakin status

就绪时的输出：

✓ 已验证
端点：https://api.anakin.io
账户：user@example.com

如果未安装：pip install anakin-cli

如果用户未登录，请始终参考 rules/install.md 中的安装规则以获取更多信息。

如果未通过身份验证，请运行：

bash
anakin login --api-key ak-your-key-here

从 anakin.io/dashboard 获取您的 API 密钥。

组织

在工作目录中创建 .anakin/ 文件夹（如果尚不存在）以存储结果。如果 .anakin/ 尚未添加到 .gitignore 文件中，请添加。始终使用 -o 直接写入文件（避免淹没上下文）：

bash
mkdir -p .anakin
echo .anakin/ >> .gitignore
anakin scrape https://example.com -o .anakin/output.md

功能

1. 抓取单个 URL

以多种格式从单个网页提取内容。

使用场景：

- 从单个网页提取内容
将网页转换为干净的 Markdown
从一个 URL 提取结构化数据
获取包含元数据的完整原始 API 响应

基本用法：

bash

干净的可读文本（默认 Markdown 格式）

anakin scrape https://example.com -o output.md

结构化数据（JSON）

anakin scrape https://example.com --format json -o output.json

包含 HTML 和元数据的完整 API 响应

anakin scrape https://example.com --format raw -o output.json

高级选项：

bash

JavaScript 密集型或单页应用网站

anakin scrape https://example.com --browser -o output.md

地理定位抓取（国家代码）

anakin scrape https://example.com --country gb -o output.md

慢速页面的自定义超时（秒）

anakin scrape https://example.com --timeout 300 -o output.md

2. 批量抓取多个 URL

一次抓取最多 10 个 URL，实现高效的并行处理。

使用场景：

- 同时抓取多个网页
比较不同网站的产品
收集多篇文章或页面
同时从多个来源收集数据

基本用法：

bash

批量抓取多个 URL（最多 10 个）

anakin scrape-batch https://example.com/page1 https://example.com/page2 https://example.com/page3 -o batch-results.json

对于大型列表（>10 个 URL）：

bash

第一批（URL 1-10）

anakin scrape-batch https://url1.com ... https://url10.com -o batch-1.json

第二批（URL 11-20）

anakin scrape-batch https://url11.com ... https://url20.com -o batch-2.json

输出格式： JSON 文件，包含组合结果、每个 URL 的状态（成功/失败）、内容、元数据和任何错误。

3. AI 驱动的网页搜索

运行智能网页搜索以查找页面、回答问题并发现来源。

使用场景：

- 查找特定主题的页面
使用网络来源回答问题
发现研究的相关来源
在抓取特定页面之前收集链接
快速事实查询

基本用法：

bash

AI 驱动的网页搜索

anakin search 您的搜索查询 -o search-results.json

后续工作流程：

bash

1. 搜索相关页面

anakin search 机器学习教程 -o search-results.json

2. 抓取特定结果以获取完整内容

anakin scrape https://来自搜索的结果-url.com -o page.md

输出格式： JSON 文件，包含搜索结果，包括标题、URL、摘要、相关性分数和元数据。

4. 深度代理研究

运行全面的自主研究，探索网络并返回详细报告。

使用场景：

- 复杂主题的全面研究
需要多个来源的市场分析
跨文档和文章的技术深度挖掘
比较研究（产品、技术、方法）
需要综合多个来源的问题

基本用法：

bash

深度代理研究（需要 1-5 分钟）

anakin research 您的研究主题或问题 -o research-report.json

复杂主题的扩展超时

anakin research 量子计算综合分析 --timeout 600 -o research-report.json

⏱️ 重要提示： 深度研究需要 1-5 分钟 并自主运行。在开始之前务必告知用户此持续时间。

功能：

- 自主搜索相关来源
抓取和分析多个页面
跨来源综合信息
生成包含引用的全面报告
提供关键见解和结论

输出格式： JSON 文件，包含执行摘要、按子主题划分的详细报告、关键见解、带 URL 的引用、置信度分数和相关主题。

决策指南

使用 anakin scrape 当：

- 您有单个特定 URL 需要提取
您需要 Markdown、JSON 或原始格式的内容
页面是静态的或 JavaScript 密集型（使用 --browser）

使用 anakin scrape-batch 当：

- 您有 2-10 个 URL 需要同时抓取
您需要高效的并行处理
您希望在一个文件中获得组合结果

使用 anakin search 当：

- 您需要首先找到相关的 URL
您想要快速的事实查询
您需要在 30 秒内获得结果
您知道自己在寻找什么

使用 anakin research 当：

- 您需要跨 5 个以上来源的全面分析
主题复杂，需要深入探索
您想要一份包含见解的综合报告
您可以等待 1-5 分钟的自主研究
问题需要比较多个视角

护栏

URL 处理

- 始终引用 URL 以防止 shell 解释 ?、&、# 字符
示例：anakin scrape https://example.com?param=value 而不是 anakin scrape https://example.com?param=value

输出管理

- 始终使用 -o 将输出保存到文件，而不是淹没终端
根据内容类型选择适当的输出文件名

格式选择

- 默认使用 Markdown 以确保可读性，除非用户明确要求 JSON 或原始格式
使用 --format json 进行结构化数据处理
使用 --format raw 获取包含 HTML 的完整 API 响应

特殊情况

- 仅在标准抓取返回空或不完整内容时使用 --browser
对于批量抓取： 每个命令最多 10 个 URL — 拆分较大的列表
对于研究： 在开始之前始终警告 1-5 分钟的持续时间

速率限制

- 遇到 HTTP 429 错误（速率限制）时，等待后重试
不要在速率限制错误上立即循环

身份验证

- 遇到 HTTP 401 错误时，重新运行 anakin login 而不是重试相同的命令

错误处理

错误	解决方案
HTTP 401（未授权）	重新运行 anakin login --api-key 您的密钥
HTTP 429（速率限制）

输出格式

Markdown（抓取的默认格式）

- 干净、可读的文本，去除导航和广告
最适合人类阅读和摘要
文件扩展名：.md

JSON（结构化）

- 包含标题、内容、元数据的结构化数据
最适合处理和解析
文件扩展名：.json

原始格式（完整响应）

- 包含 HTML、链接、图像、元数据的完整 API 响应
最适合调试或访问所有可用数据
文件扩展名：

anakin网站数据转换

anakin

Anakin - Web Data Extraction

Installation & Authentication

Organization

Capabilities

1. Scrape a Single URL

2. Batch Scrape Multiple URLs

3. AI-Powered Web Search

4. Deep Agentic Research

Decision Guide

Guardrails

URL Handling

Output Management

Format Selection

Special Cases

Rate Limiting

Authentication

Error Handling

Output Formats

Markdown (default for scrape)

JSON (structured)

Raw (full response)

Examples

Example 1: Article extraction

Example 2: Product comparison

Example 3: Find and scrape

Example 4: Market research

Example 5: JavaScript-heavy site

Example 6: Geo-targeted content

Best Practices

Troubleshooting

Authentication issues

Empty or incomplete content

Rate limiting

Resources

Anakin - 网页数据提取

安装与身份验证

组织

功能

1. 抓取单个 URL

干净的可读文本（默认 Markdown 格式）

结构化数据（JSON）

包含 HTML 和元数据的完整 API 响应

JavaScript 密集型或单页应用网站

地理定位抓取（国家代码）

慢速页面的自定义超时（秒）

2. 批量抓取多个 URL

批量抓取多个 URL（最多 10 个）

第一批（URL 1-10）

第二批（URL 11-20）

3. AI 驱动的网页搜索

AI 驱动的网页搜索

1. 搜索相关页面

2. 抓取特定结果以获取完整内容

4. 深度代理研究

深度代理研究（需要 1-5 分钟）

复杂主题的扩展超时

决策指南

护栏

URL 处理

输出管理

格式选择

特殊情况

速率限制

身份验证

错误处理

输出格式

Markdown（抓取的默认格式）

JSON（结构化）

原始格式（完整响应）

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement