browser-scraper真实浏览器抓取

Scrape websites using a real Chrome browser with the user's Chrome profile — shares cookies, auth, and fingerprint to bypass bot detection (Cloudflare, Reddit, etc.). Use when scraping sites that block headless browsers or require login, or when asked to "open a browser and scrape", "take a screenshot of a page", "get data from a site that blocks bots", or "scrape with a specific Chrome profile".

作者: admin | 来源: ClawHub

Browser Scraper

Scrapes web pages using Playwright with a real Chrome/Chromium binary and an existing user profile. Bypasses bot detection by sharing existing cookies, fingerprint, and session.

Profiles

The scraper supports multiple Chrome profiles:

- Default (no --profile flag): Uses the system's default Chrome profile

- macOS: ~/Library/Application Support/Google/Chrome/Default - Linux: ~/.config/google-chrome/Default - Windows: INLINECODE3

- Named profile (--profile <name>): Uses profiles/<name>/ under the skill directory

- Create a profile by launching Chrome with --profile-directory=Profile 1 or similar, then point the scraper at that folder - Useful for: isolating logins, avoiding conflicts with your main Chrome session, scraping without auth

Script

CODEBLOCK0

Run from the skill directory:
CODEBLOCK1

Output

- JSON to stdout: matched elements or page preview
Screenshot saved to INLINECODE7

Key Design

- channel: 'chrome' — launches real Chrome when available, falls back to system Chromium
INLINECODE9 with the profile directory
INLINECODE10 + navigator.webdriver patch
INLINECODE12 by default to avoid SingletonLock conflicts

Requirements

- Playwright installed: INLINECODE13
Chrome or Chromium installed on the system
On macOS/Linux: the channel: 'chrome' option requires Chrome (not Chromium) to be installed

Tips

- Chrome must not already be open with the target profile (SingletonLock error). Close Chrome first, or use a named profile to avoid conflicts.
If you get a SingletonLock error with a named profile, delete the SingletonLock file in that profile directory and try again.
Use --keep-open to leave the browser open for interactive use after scraping — Ctrl+C to close.
For sites with lazy-loaded content: use --wait <ms> flag or modify the script to increase INLINECODE19
For Reddit: use selector shreddit-post and read attributes (post-title, author, score, permalink)
To create a fresh isolated profile: run Chrome from the terminal with --profile-directory=Profile X and log in, then point the scraper at that directory

浏览器抓取工具

使用Playwright配合真实的Chrome/Chromium二进制文件和现有用户配置文件来抓取网页。通过共享现有cookies、指纹和会话来绕过机器人检测。

配置文件

该抓取工具支持多个Chrome配置文件：

- 默认（不带--profile参数）： 使用系统的默认Chrome配置文件

- macOS：~/Library/Application Support/Google/Chrome/Default - Linux：~/.config/google-chrome/Default - Windows：%LOCALAPPDATA%\Google\Chrome\User Data\Default

- 命名配置文件（--profile <名称>）： 使用技能目录下的profiles/<名称>/文件夹

- 通过使用--profile-directory=Profile 1或类似参数启动Chrome来创建配置文件，然后将抓取工具指向该文件夹 - 用途：隔离登录信息、避免与主Chrome会话冲突、无需认证即可抓取

脚本

bash

默认配置文件（系统Chrome）

node scripts/scrape.mjs <网址> [css选择器]

命名配置文件（profiles/<名称>/）

node scripts/scrape.mjs <网址> [css选择器] --profile <名称>

无头模式（更快，但被拦截风险更高）

node scripts/scrape.mjs <网址> --headless --profile <名称>

抓取后保持浏览器打开（用于交互式使用）

node scripts/scrape.mjs <网址> --profile <名称> --keep-open

额外等待懒加载内容（默认：3000ms）

node scripts/scrape.mjs <网址> --profile <名称> --wait 6000

从技能目录运行：
bash
cd ~/.openclaw-yekeen/workspace/skills/browser-scraper/
node scripts/scrape.mjs https://www.reddit.com/

输出

- JSON输出到stdout：匹配的元素或页面预览
截图保存到/tmp/browser-scraper-last.png

关键设计

- channel: chrome — 可用时启动真实Chrome，否则回退到系统Chromium
使用配置文件目录的launchPersistentContext
--disable-blink-features=AutomationControlled + navigator.webdriver补丁
默认headless: false以避免SingletonLock冲突

要求

- 已安装Playwright：npm install playwright
系统上已安装Chrome或Chromium
在macOS/Linux上：channel: chrome选项需要安装Chrome（而非Chromium）

提示

- Chrome不能已经以目标配置文件打开（会出现SingletonLock错误）。先关闭Chrome，或使用命名配置文件避免冲突。
如果使用命名配置文件时出现SingletonLock错误，请删除该配置文件目录中的SingletonLock文件，然后重试。
使用--keep-open在抓取后保持浏览器打开以便交互使用 — 按Ctrl+C关闭。
对于有懒加载内容的网站：使用--wait <毫秒>参数或修改脚本增加waitForTimeout
对于Reddit：使用选择器shreddit-post并读取属性（post-title、author、score、permalink）
要创建全新的隔离配置文件：在终端中使用--profile-directory=Profile X运行Chrome并登录，然后将抓取工具指向该目录

browser-scraper真实浏览器抓取

browser-scraper

Browser Scraper

Profiles

Script

Output

Key Design

Requirements

Tips

浏览器抓取工具

配置文件

脚本

默认配置文件（系统Chrome）

命名配置文件（profiles/<名称>/）

无头模式（更快，但被拦截风险更高）

抓取后保持浏览器打开（用于交互式使用）

额外等待懒加载内容（默认：3000ms）

输出

关键设计

要求

提示

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

browser-scraper真实浏览器抓取

browser-scraper

Browser Scraper

Profiles

Script

Output

Key Design

Requirements

Tips

浏览器抓取工具

配置文件

脚本

默认配置文件（系统Chrome）

命名配置文件（profiles/<名称>/）

无头模式（更快，但被拦截风险更高）

抓取后保持浏览器打开（用于交互式使用）

额外等待懒加载内容（默认：3000ms）

输出

关键设计

要求

提示

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement