【教程】用Python+Playwright打造AI智能网页爬虫,自动提取结构化数据
本教程基于GitHub热门项目AgentQL和Skyvern的思路,教你用Python+Playwright构建一个能自动理解网页结构、智能提取数据的爬虫工具。无需写繁琐的XPath,AI帮你定位元素!
一、前置条件
在开始之前,请确保你已安装以下环境:
- Python 3.9+
- pip 包管理器
- 一个可用的LLM API Key(OpenAI、Claude、Kimi等均可)
二、核心原理
传统爬虫需要人工分析HTML结构、编写XPath或CSS选择器。而AI智能爬虫的工作流程是:
- 1. Playwright 加载目标网页并截图
- 2. 将页面HTML + 截图发送给LLM
- 3. LLM分析页面结构,返回数据提取策略
- 4. 按策略提取数据,输出结构化结果
复制代码
这种方法的优势:
- 自适应页面结构变化,无需维护选择器
- 能理解复杂布局(表格、卡片、瀑布流)
- 支持多页自动翻页、表单填写
三、步骤详解
步骤1:安装依赖
- pip install playwright openai python-dotenv
- playwright install chromium
复制代码
步骤2:创建配置文件 .env
- OPENAI_API_KEY=your_api_key_here
- OPENAI_BASE_URL=https://api.openai.com/v1 # 或其他兼容接口
复制代码
步骤3:编写智能爬虫核心代码
创建 ai_scraper.py:
- import os
- import json
- import base64
- from playwright.sync_api import sync_playwright
- from openai import OpenAI
- from dotenv import load_dotenv
- load_dotenv()
- class AIScraper:
- def __init__(self):
- self.client = OpenAI(
- api_key=os.getenv("OPENAI_API_KEY"),
- base_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1")
- )
-
- def capture_page(self, url):
- """用Playwright加载页面并截图"""
- with sync_playwright() as p:
- browser = p.chromium.launch(headless=True)
- page = browser.new_page(viewport={"width": 1920, "height": 1080})
- page.goto(url, wait_until="networkidle")
-
- # 获取页面HTML
- html = page.content()
-
- # 截图并转为base64
- screenshot = page.screenshot(type="jpeg", quality=80)
- screenshot_b64 = base64.b64encode(screenshot).decode()
-
- browser.close()
- return html, screenshot_b64
-
- def analyze_page(self, html, screenshot_b64, instruction):
- """调用LLM分析页面,返回提取策略"""
- prompt = f"""
- 你是一个网页数据提取专家。请分析以下网页内容,帮我提取指定数据。
- 提取需求:{instruction}
- 页面HTML片段(前5000字符):
- {html[:5000]}
- 请返回JSON格式的提取策略,包含:
- 1. "data_type": 数据类型(list/table/detail)
- 2. "selectors": 具体的CSS选择器或提取逻辑
- 3. "fields": 需要提取的字段列表
- 4. "next_page": 是否有下一页,如何翻页(可选)
- 只返回JSON,不要其他解释。
- """
-
- response = self.client.chat.completions.create(
- model="gpt-4o-mini", # 或其他可用模型
- messages=[
- {
- "role": "user",
- "content": [
- {"type": "text", "text": prompt},
- {
- "type": "image_url",
- "image_url": {
- "url": f"data:image/jpeg;base64,{screenshot_b64}"
- }
- }
- ]
- }
- ],
- response_format={"type": "json_object"}
- )
-
- return json.loads(response.choices[0].message.content)
-
- def extract_data(self, url, instruction):
- """主流程:截图 -> AI分析 -> 提取数据"""
- print(f"正在加载页面:{url}")
- html, screenshot = self.capture_page(url)
-
- print("正在分析页面结构...")
- strategy = self.analyze_page(html, screenshot, instruction)
-
- print(f"提取策略:{json.dumps(strategy, ensure_ascii=False, indent=2)}")
-
- # 根据策略提取数据
- with sync_playwright() as p:
- browser = p.chromium.launch(headless=True)
- page = browser.new_page()
- page.goto(url, wait_until="networkidle")
-
- data = self._execute_strategy(page, strategy)
- browser.close()
- return data
-
- def _execute_strategy(self, page, strategy):
- """根据AI返回的策略执行数据提取"""
- data_type = strategy.get("data_type", "list")
- selectors = strategy.get("selectors", {})
- fields = strategy.get("fields", [])
-
- results = []
-
- if data_type == "list":
- items = page.query_selector_all(selectors.get("item", "body"))
- for item in items[:20]: # 限制数量
- row = {}
- for field in fields:
- name = field["name"]
- selector = field.get("selector", "")
- attr = field.get("attribute", "textContent")
-
- el = item.query_selector(selector) if selector else item
- if el:
- if attr == "textContent":
- row[name] = el.text_content().strip()
- elif attr == "href":
- row[name] = el.get_attribute("href")
- else:
- row[name] = el.get_attribute(attr)
- else:
- row[name] = None
- results.append(row)
-
- elif data_type == "table":
- rows = page.query_selector_all(selectors.get("row", "tr"))
- for row_el in rows[1:]: # 跳过表头
- cells = row_el.query_selector_all("td")
- row = {}
- for i, field in enumerate(fields):
- if i < len(cells):
- row[field["name"]] = cells[i].text_content().strip()
- results.append(row)
-
- return results
- # 使用示例
- if __name__ == "__main__":
- scraper = AIScraper()
-
- # 示例:提取新闻列表
- url = "https://news.ycombinator.com"
- instruction = "提取首页所有新闻的标题、链接和评分"
-
- data = scraper.extract_data(url, instruction)
- print(json.dumps(data, ensure_ascii=False, indent=2))
复制代码
步骤4:运行测试
四、进阶:自动翻页采集
如果需要采集多页数据,可以扩展翻页逻辑:
- def extract_with_pagination(self, start_url, instruction, max_pages=5):
- """支持自动翻页的数据采集"""
- all_data = []
- current_url = start_url
- page_count = 0
-
- while current_url and page_count < max_pages:
- print(f"正在采集第 {page_count + 1} 页...")
-
- html, screenshot = self.capture_page(current_url)
- strategy = self.analyze_page(html, screenshot, instruction)
-
- with sync_playwright() as p:
- browser = p.chromium.launch(headless=True)
- page = browser.new_page()
- page.goto(current_url, wait_until="networkidle")
-
- data = self._execute_strategy(page, strategy)
- all_data.extend(data)
-
- # 查找下一页链接
- next_selector = strategy.get("next_page", {}).get("selector")
- if next_selector:
- next_el = page.query_selector(next_selector)
- current_url = next_el.get_attribute("href") if next_el else None
- if current_url and not current_url.startswith("http"):
- from urllib.parse import urljoin
- current_url = urljoin(start_url, current_url)
- else:
- current_url = None
-
- browser.close()
-
- page_count += 1
-
- return all_data
复制代码
五、常见问题
Q1:LLM API费用高吗?
使用gpt-4o-mini或国产大模型(如Kimi、通义千问),每次分析成本约0.01-0.05元。对于小规模采集非常划算。
Q2:遇到反爬怎么办?
- 使用playwright-stealth插件隐藏自动化特征
- 设置合理的请求间隔(time.sleep(random.uniform(2, 5)))
- 使用代理IP轮换
Q3:提取结果不准确?
- 在prompt中提供更详细的字段说明
- 增加截图分辨率,让LLM看清页面布局
- 对复杂页面分块处理
Q4:支持JavaScript渲染的页面吗?
支持!Playwright本身就是完整的浏览器,能执行所有JavaScript。
六、总结
通过本教程,你学会了:
- 用Playwright加载和截图网页
- 调用LLM智能分析页面结构
- 自动生成提取策略并执行
- 支持翻页的自动化数据采集
这种AI+爬虫的组合,大幅降低了维护成本。当目标网站改版时,只需重新运行分析流程,无需手动更新选择器。
项目灵感来源:GitHub热门项目 AgentQL(1.4k+ Star)和 Skyvern(22k+ Star),两者都是AI驱动的新一代网页自动化工具。
相关资源:
- AgentQL:https://github.com/tinyfish-io/agentql
- Skyvern:https://github.com/Skyvern-AI/skyvern
- Playwright文档:https://playwright.dev/python/
有问题欢迎在楼下交流! |