Web Search Scraper API Skill
📖 Introduction
This skill provides users with a one-stop web page extraction service through the BrowserAct Web Search Scraper API template. It can directly extract structured markdown content from any given URL. By simply inputting the target URL, you can get clean and usable markdown data.
✨ Features
- 1. No hallucinations, ensuring stable and precise data extraction: Pre-set workflows avoid AI generative hallucinations.
- No human-machine verification issues: No need to deal with reCAPTCHA or other verification challenges.
- No IP access restrictions or geofencing: No need to handle regional IP limitations.
- More agile execution speed: Compared to purely AI-driven browser automation solutions, task execution is faster.
- Extremely high cost-effectiveness: Compared to AI solutions that consume a lot of Tokens, it can significantly reduce the cost of data acquisition.
🔑 API Key Guidance Process
Before running, you must check the
BROWSERACT_API_KEY environment variable. If it is not set, do not take other actions first; you should ask and wait for the user to provide it cooperatively.
The Agent must inform the user at this time:
"Since you have not configured the BrowserAct API Key, please go to the BrowserAct Console first to get your Key."
🛠️ Input Parameters Details
Agent should flexibly configure the following parameters based on user needs when calling the script:
- 1. target_url
-
Type:
string
-
Description: The website URL to extract content from. Supports any HTTP/HTTPS URL.
-
Example: INLINECODE2
🚀 Invocation Method (Recommended)
Agent should execute the following independent script to achieve "one command gets the result":
CODEBLOCK0
⏳ Execution Status Monitoring
Since the task involves automated browser operations, it may take a long time (several minutes). The script will
continuously output status logs with timestamps (e.g.,
[14:30:05] Task Status: running) while running.
Notice for Agent:
- - While waiting for the script to return results, please keep paying attention to the terminal output.
- As long as the terminal is still outputting new status logs, it means the task is running normally. Do not misjudge it as a deadlock or unresponsiveness.
- If the status remains unchanged for a long time or the script stops outputting and no result is returned, the retry mechanism can be triggered.
📊 Data Output Description
Upon successful execution, the script will directly parse and print the result from the API response. The result contains:
- -
content: The complete markdown content of the webpage.
⚠️ Error Handling & Retry Mechanism
During the execution of the script, if an error occurs (such as network fluctuation or task failure), the Agent should follow this logic:
- 1. Check the output content:
- If the output
contains "Invalid authorization", it means the API Key is invalid or expired. At this time,
do not retry, and you should guide the user to recheck and provide the correct API Key.
- If the output
does not contain "Invalid authorization" but the task execution fails (for example, the output starts with
Error: or the returned result is empty), the Agent should
automatically try to re-execute the script once.
- 2. Retry limit:
- Automatic retry is limited to
once. If the second attempt still fails, stop retrying and report the specific error message to the user.
🌟 Typical Use Cases
- 1. Article Extraction: Scrape the main content of a news article link into markdown.
- Blog Post Parsing: Download the readable text from a target blog post URL.
- Webpage to Markdown: Convert any given website URL into clean markdown format.
- Documentation Scraping: Fetch the contents of a tutorial or documentation page for offline reading.
- Content Monitoring: Automatically extract the text from a specific webpage for updates.
- Data Processing: Parse the HTML of an arbitrary HTTP/HTTPS URL to structure its content.
Web Search Scraper API 技能
📖 简介
本技能通过 BrowserAct Web Search Scraper API 模板,为用户提供一站式网页内容提取服务。它能够直接从任意给定的 URL 中提取结构化的 Markdown 内容。只需输入目标 URL,即可获得干净可用的 Markdown 数据。
✨ 功能特性
- 1. 无幻觉,确保数据提取稳定精准:预设工作流避免了 AI 生成式幻觉。
- 无人机验证问题:无需处理 reCAPTCHA 或其他验证挑战。
- 无 IP 访问限制或地理围栏:无需处理区域 IP 限制。
- 执行速度更敏捷:相比纯 AI 驱动的浏览器自动化方案,任务执行更快。
- 性价比极高:相比消耗大量 Token 的 AI 方案,可大幅降低数据获取成本。
🔑 API Key 引导流程
运行前,必须检查 BROWSERACT
APIKEY 环境变量。如果未设置,请先不要执行其他操作,应询问并等待用户配合提供。
此时 Agent 必须告知用户:
由于您尚未配置 BrowserAct API Key,请先前往 BrowserAct 控制台 获取您的 Key。
🛠️ 输入参数详情
Agent 在调用脚本时,应根据用户需求灵活配置以下参数:
- 1. target_url
-
类型:string
-
描述:要提取内容的网站 URL。支持任何 HTTP/HTTPS URL。
-
示例:https://www.browseract.com
🚀 调用方式(推荐)
Agent 应执行以下独立脚本,实现一键获取结果:
bash
示例调用
python -u ./scripts/web
searchscraper
api.py targeturl
⏳ 执行状态监控
由于任务涉及自动化浏览器操作,可能需要较长时间(数分钟)。脚本在运行时会
持续输出带时间戳的状态日志(例如 [14:30:05] 任务状态:运行中)。
Agent 注意事项:
- - 在等待脚本返回结果期间,请持续关注终端输出。
- 只要终端仍在输出新的状态日志,即表示任务正常运行。请勿误判为死锁或无响应。
- 如果状态长时间未变化,或脚本停止输出且未返回结果,可触发重试机制。
📊 数据输出说明
执行成功后,脚本将直接从 API 响应中解析并打印结果。结果包含:
- - content:网页的完整 Markdown 内容。
⚠️ 错误处理与重试机制
脚本执行过程中,如果发生错误(如网络波动或任务失败),Agent 应遵循以下逻辑:
- 1. 检查输出内容:
- 如果输出
包含 Invalid authorization,表示 API Key 无效或已过期。此时
不要重试,应引导用户重新检查并提供正确的 API Key。
- 如果输出
不包含 Invalid authorization,但任务执行失败(例如输出以 Error: 开头或返回结果为空),Agent 应
自动尝试重新执行脚本一次。
- 2. 重试限制:
- 自动重试仅限
一次。如果第二次尝试仍然失败,请停止重试并向用户报告具体的错误信息。
🌟 典型使用场景
- 1. 文章提取:将新闻文章链接的主要内容抓取为 Markdown。
- 博客文章解析:下载目标博客文章 URL 的可读文本。
- 网页转 Markdown:将任意给定的网站 URL 转换为干净的 Markdown 格式。
- 文档抓取:获取教程或文档页面的内容,用于离线阅读。
- 内容监控:自动提取特定网页的文本以检查更新。
- 数据处理:解析任意 HTTP/HTTPS URL 的 HTML,以结构化其内容。