Anakin - Web Data Extraction
Convert websites into clean data at scale using the anakin-cli. Supports single URL scraping, batch scraping, AI-powered search, and autonomous deep research.
Installation & Authentication
Check status and authentication:
CODEBLOCK0
Output when ready:
CODEBLOCK1
If not installed: INLINECODE0
Always refer to the installation rules in rules/install.md for more information if the user is not logged in.
If not authenticated, run:
CODEBLOCK2
Get your API key from anakin.io/dashboard.
Organization
Create a .anakin/ folder in the working directory unless it already exists to store results. Add .anakin/ to the .gitignore file if not already there. Always use -o to write directly to file (avoids flooding context):
CODEBLOCK3
Capabilities
1. Scrape a Single URL
Extract content from a single web page in multiple formats.
When to use:
- - Extracting content from a single web page
- Converting a webpage to clean markdown
- Extracting structured data from one URL
- Getting full raw API response with metadata
Basic usage:
CODEBLOCK4
Advanced options:
CODEBLOCK5
2. Batch Scrape Multiple URLs
Scrape up to 10 URLs at once for efficient parallel processing.
When to use:
- - Scraping multiple web pages simultaneously
- Comparing products across different sites
- Collecting multiple articles or pages
- Gathering data from several sources at once
Basic usage:
CODEBLOCK6
For large lists (>10 URLs):
CODEBLOCK7
Output format: JSON file with combined results, each URL's status (success/failure), content, metadata, and any errors.
3. AI-Powered Web Search
Run intelligent web searches to find pages, answer questions, and discover sources.
When to use:
- - Finding pages on a specific topic
- Answering questions with web sources
- Discovering relevant sources for research
- Gathering links before scraping specific pages
- Quick factual lookups
Basic usage:
CODEBLOCK8
Follow-up workflow:
CODEBLOCK9
Output format: JSON file with search results including titles, URLs, snippets, relevance scores, and metadata.
4. Deep Agentic Research
Run comprehensive autonomous research that explores the web and returns detailed reports.
When to use:
- - Comprehensive research on complex topics
- Market analysis requiring multiple sources
- Technical deep-dives across documentation and articles
- Comparison research (products, technologies, approaches)
- Questions requiring synthesis from many sources
Basic usage:
CODEBLOCK10
⏱️ Important: Deep research takes 1-5 minutes and runs autonomously. Always inform the user about this duration before starting.
What it does:
- - Autonomously searches for relevant sources
- Scrapes and analyzes multiple pages
- Synthesizes information across sources
- Generates comprehensive reports with citations
- Provides key insights and conclusions
Output format: JSON file with executive summary, detailed report by subtopics, key insights, citations with URLs, confidence scores, and related topics.
Decision Guide
Use anakin scrape when:
- - You have a single specific URL to extract
- You need content in markdown, JSON, or raw format
- The page is static or JavaScript-heavy (use
--browser)
Use anakin scrape-batch when:
- - You have 2-10 URLs to scrape simultaneously
- You need efficient parallel processing
- You want combined results in one file
Use anakin search when:
- - You need to find relevant URLs first
- You want quick factual lookups
- You need results in under 30 seconds
- You know what you're looking for
Use anakin research when:
- - You need comprehensive analysis across 5+ sources
- The topic is complex and requires deep exploration
- You want a synthesized report with insights
- You can wait 1-5 minutes for autonomous research
- The question requires comparing multiple perspectives
Guardrails
URL Handling
- - Always quote URLs to prevent shell interpretation of
?, &, # characters - Example:
anakin scrape "https://example.com?param=value" not INLINECODE14
Output Management
- - Always use
-o <file> to save output to a file rather than flooding the terminal - Choose appropriate output filenames based on content type
Format Selection
- - Default to markdown for readability unless user explicitly asks for JSON or raw
- Use
--format json for structured data processing - Use
--format raw for full API response with HTML
Special Cases
- - Use
--browser only when standard scrape returns empty or incomplete content - For batch scraping: Maximum 10 URLs per command — split larger lists
- For research: Always warn about 1-5 minute duration before starting
Rate Limiting
- - On HTTP 429 errors (rate limit), wait before retrying
- Do not loop immediately on rate limit errors
Authentication
- - On HTTP 401 errors, re-run
anakin login rather than retrying the same command
Error Handling
| Error | Solution |
|---|
| HTTP 401 (Unauthorized) | Re-run INLINECODE20 |
| HTTP 429 (Rate Limited) |
Wait before retrying, do not loop immediately |
| Empty content | Try adding
--browser flag for JavaScript-heavy sites |
| Timeout | Increase with
--timeout <seconds> for slow pages |
| Batch partial failure | Check output JSON for individual statuses, retry failed URLs with
--browser |
| Research fails | Fall back to
search + multiple
scrape calls manually |
Output Formats
Markdown (default for scrape)
- - Clean, readable text stripped of navigation and ads
- Best for human reading and summarization
- File extension: INLINECODE26
JSON (structured)
- - Structured data with title, content, metadata
- Best for processing and parsing
- File extension: INLINECODE27
Raw (full response)
- - Full API response including HTML, links, images, metadata
- Best for debugging or accessing all available data
- File extension: INLINECODE28
Examples
Example 1: Article extraction
CODEBLOCK11
Example 2: Product comparison
CODEBLOCK12
Example 3: Find and scrape
CODEBLOCK13
Example 4: Market research
CODEBLOCK14
Example 5: JavaScript-heavy site
CODEBLOCK15
Example 6: Geo-targeted content
CODEBLOCK16
Best Practices
- 1. Start simple: Try basic scrape first, add flags only if needed
- Be specific: Use clear, specific search queries and research topics
- Quote URLs: Always wrap URLs in quotes
- Save output: Always use
-o flag to save results to files - Check status: Run
anakin status before starting work - Batch wisely: Group similar URLs together, max 10 per batch
- Wait on rate limits: Don't retry immediately on 429 errors
- Choose the right tool:
- Single page →
scrape
- Multiple pages →
scrape-batch
- Don't have URLs →
search first
- Need deep analysis → INLINECODE34
Troubleshooting
Authentication issues
CODEBLOCK17
Empty or incomplete content
- - Add
--browser flag for JavaScript-heavy sites - Increase timeout with INLINECODE36
- Check if the site requires specific geo-location with INLINECODE37
Rate limiting
- - Wait before retrying (don't loop immediately)
- Consider spacing out requests for large batch operations
- Check your API plan limits at anakin.io/dashboard
Resources
Anakin - 网页数据提取
使用 anakin-cli 将网站大规模转换为干净的数据。支持单 URL 抓取、批量抓取、AI 驱动的搜索和自主深度研究。
安装与身份验证
检查状态和身份验证:
bash
anakin status
就绪时的输出:
✓ 已验证
端点:https://api.anakin.io
账户:user@example.com
如果未安装:pip install anakin-cli
如果用户未登录,请始终参考 rules/install.md 中的安装规则以获取更多信息。
如果未通过身份验证,请运行:
bash
anakin login --api-key ak-your-key-here
从 anakin.io/dashboard 获取您的 API 密钥。
组织
在工作目录中创建 .anakin/ 文件夹(如果尚不存在)以存储结果。如果 .anakin/ 尚未添加到 .gitignore 文件中,请添加。始终使用 -o 直接写入文件(避免淹没上下文):
bash
mkdir -p .anakin
echo .anakin/ >> .gitignore
anakin scrape https://example.com -o .anakin/output.md
功能
1. 抓取单个 URL
以多种格式从单个网页提取内容。
使用场景:
- - 从单个网页提取内容
- 将网页转换为干净的 Markdown
- 从一个 URL 提取结构化数据
- 获取包含元数据的完整原始 API 响应
基本用法:
bash
干净的可读文本(默认 Markdown 格式)
anakin scrape https://example.com -o output.md
结构化数据(JSON)
anakin scrape https://example.com --format json -o output.json
包含 HTML 和元数据的完整 API 响应
anakin scrape https://example.com --format raw -o output.json
高级选项:
bash
JavaScript 密集型或单页应用网站
anakin scrape https://example.com --browser -o output.md
地理定位抓取(国家代码)
anakin scrape https://example.com --country gb -o output.md
慢速页面的自定义超时(秒)
anakin scrape https://example.com --timeout 300 -o output.md
2. 批量抓取多个 URL
一次抓取最多 10 个 URL,实现高效的并行处理。
使用场景:
- - 同时抓取多个网页
- 比较不同网站的产品
- 收集多篇文章或页面
- 同时从多个来源收集数据
基本用法:
bash
批量抓取多个 URL(最多 10 个)
anakin scrape-batch https://example.com/page1 https://example.com/page2 https://example.com/page3 -o batch-results.json
对于大型列表(>10 个 URL):
bash
第一批(URL 1-10)
anakin scrape-batch https://url1.com ... https://url10.com -o batch-1.json
第二批(URL 11-20)
anakin scrape-batch https://url11.com ... https://url20.com -o batch-2.json
输出格式: JSON 文件,包含组合结果、每个 URL 的状态(成功/失败)、内容、元数据和任何错误。
3. AI 驱动的网页搜索
运行智能网页搜索以查找页面、回答问题并发现来源。
使用场景:
- - 查找特定主题的页面
- 使用网络来源回答问题
- 发现研究的相关来源
- 在抓取特定页面之前收集链接
- 快速事实查询
基本用法:
bash
AI 驱动的网页搜索
anakin search 您的搜索查询 -o search-results.json
后续工作流程:
bash
1. 搜索相关页面
anakin search 机器学习教程 -o search-results.json
2. 抓取特定结果以获取完整内容
anakin scrape https://来自搜索的结果-url.com -o page.md
输出格式: JSON 文件,包含搜索结果,包括标题、URL、摘要、相关性分数和元数据。
4. 深度代理研究
运行全面的自主研究,探索网络并返回详细报告。
使用场景:
- - 复杂主题的全面研究
- 需要多个来源的市场分析
- 跨文档和文章的技术深度挖掘
- 比较研究(产品、技术、方法)
- 需要综合多个来源的问题
基本用法:
bash
深度代理研究(需要 1-5 分钟)
anakin research 您的研究主题或问题 -o research-report.json
复杂主题的扩展超时
anakin research 量子计算综合分析 --timeout 600 -o research-report.json
⏱️ 重要提示: 深度研究需要 1-5 分钟 并自主运行。在开始之前务必告知用户此持续时间。
功能:
- - 自主搜索相关来源
- 抓取和分析多个页面
- 跨来源综合信息
- 生成包含引用的全面报告
- 提供关键见解和结论
输出格式: JSON 文件,包含执行摘要、按子主题划分的详细报告、关键见解、带 URL 的引用、置信度分数和相关主题。
决策指南
使用 anakin scrape 当:
- - 您有单个特定 URL 需要提取
- 您需要 Markdown、JSON 或原始格式的内容
- 页面是静态的或 JavaScript 密集型(使用 --browser)
使用 anakin scrape-batch 当:
- - 您有 2-10 个 URL 需要同时抓取
- 您需要高效的并行处理
- 您希望在一个文件中获得组合结果
使用 anakin search 当:
- - 您需要首先找到相关的 URL
- 您想要快速的事实查询
- 您需要在 30 秒内获得结果
- 您知道自己在寻找什么
使用 anakin research 当:
- - 您需要跨 5 个以上来源的全面分析
- 主题复杂,需要深入探索
- 您想要一份包含见解的综合报告
- 您可以等待 1-5 分钟的自主研究
- 问题需要比较多个视角
护栏
URL 处理
- - 始终引用 URL 以防止 shell 解释 ?、&、# 字符
- 示例:anakin scrape https://example.com?param=value 而不是 anakin scrape https://example.com?param=value
输出管理
- - 始终使用 -o 将输出保存到文件,而不是淹没终端
- 根据内容类型选择适当的输出文件名
格式选择
- - 默认使用 Markdown 以确保可读性,除非用户明确要求 JSON 或原始格式
- 使用 --format json 进行结构化数据处理
- 使用 --format raw 获取包含 HTML 的完整 API 响应
特殊情况
- - 仅在 标准抓取返回空或不完整内容时使用 --browser
- 对于批量抓取: 每个命令最多 10 个 URL — 拆分较大的列表
- 对于研究: 在开始之前始终警告 1-5 分钟的持续时间
速率限制
- - 遇到 HTTP 429 错误(速率限制)时,等待后重试
- 不要在速率限制错误上立即循环
身份验证
- - 遇到 HTTP 401 错误时,重新运行 anakin login 而不是重试相同的命令
错误处理
| 错误 | 解决方案 |
|---|
| HTTP 401(未授权) | 重新运行 anakin login --api-key 您的密钥 |
| HTTP 429(速率限制) |
等待后重试,不要立即循环 |
| 内容为空 | 尝试为 JavaScript 密集型网站添加 --browser 标志 |
| 超时 | 使用 --timeout <秒> 增加慢速页面的超时时间 |
| 批量部分失败 | 检查输出 JSON 中的各个状态,使用 --browser 重试失败的 URL |
| 研究失败 | 手动回退到 search + 多个 scrape 调用 |
输出格式
Markdown(抓取的默认格式)
- - 干净、可读的文本,去除导航和广告
- 最适合人类阅读和摘要
- 文件扩展名:.md
JSON(结构化)
- - 包含标题、内容、元数据的结构化数据
- 最适合处理和解析
- 文件扩展名:.json
原始格式(完整响应)
- - 包含 HTML、链接、图像、元数据的完整 API 响应
- 最适合调试或访问所有可用数据
- 文件扩展名: