Browser Scraper
Scrapes web pages using Playwright with a real Chrome/Chromium binary and an existing user profile. Bypasses bot detection by sharing existing cookies, fingerprint, and session.
Profiles
The scraper supports multiple Chrome profiles:
- - Default (no
--profile flag): Uses the system's default Chrome profile
- macOS:
~/Library/Application Support/Google/Chrome/Default
- Linux:
~/.config/google-chrome/Default
- Windows: INLINECODE3
- - Named profile (
--profile <name>): Uses profiles/<name>/ under the skill directory
- Create a profile by launching Chrome with
--profile-directory=Profile 1 or similar, then point the scraper at that folder
- Useful for: isolating logins, avoiding conflicts with your main Chrome session, scraping without auth
Script
CODEBLOCK0
Run from the skill directory:
CODEBLOCK1
Output
- - JSON to stdout: matched elements or page preview
- Screenshot saved to INLINECODE7
Key Design
- -
channel: 'chrome' — launches real Chrome when available, falls back to system Chromium - INLINECODE9 with the profile directory
- INLINECODE10 +
navigator.webdriver patch - INLINECODE12 by default to avoid SingletonLock conflicts
Requirements
- - Playwright installed: INLINECODE13
- Chrome or Chromium installed on the system
- On macOS/Linux: the
channel: 'chrome' option requires Chrome (not Chromium) to be installed
Tips
- - Chrome must not already be open with the target profile (SingletonLock error). Close Chrome first, or use a named profile to avoid conflicts.
- If you get a
SingletonLock error with a named profile, delete the SingletonLock file in that profile directory and try again. - Use
--keep-open to leave the browser open for interactive use after scraping — Ctrl+C to close. - For sites with lazy-loaded content: use
--wait <ms> flag or modify the script to increase INLINECODE19 - For Reddit: use selector
shreddit-post and read attributes (post-title, author, score, permalink) - To create a fresh isolated profile: run Chrome from the terminal with
--profile-directory=Profile X and log in, then point the scraper at that directory
浏览器抓取工具
使用Playwright配合真实的Chrome/Chromium二进制文件和现有用户配置文件来抓取网页。通过共享现有cookies、指纹和会话来绕过机器人检测。
配置文件
该抓取工具支持多个Chrome配置文件:
- - 默认(不带--profile参数): 使用系统的默认Chrome配置文件
- macOS:~/Library/Application Support/Google/Chrome/Default
- Linux:~/.config/google-chrome/Default
- Windows:%LOCALAPPDATA%\Google\Chrome\User Data\Default
- - 命名配置文件(--profile <名称>): 使用技能目录下的profiles/<名称>/文件夹
- 通过使用--profile-directory=Profile 1或类似参数启动Chrome来创建配置文件,然后将抓取工具指向该文件夹
- 用途:隔离登录信息、避免与主Chrome会话冲突、无需认证即可抓取
脚本
bash
默认配置文件(系统Chrome)
node scripts/scrape.mjs <网址> [css选择器]
命名配置文件(profiles/<名称>/)
node scripts/scrape.mjs <网址> [css选择器] --profile <名称>
无头模式(更快,但被拦截风险更高)
node scripts/scrape.mjs <网址> --headless --profile <名称>
抓取后保持浏览器打开(用于交互式使用)
node scripts/scrape.mjs <网址> --profile <名称> --keep-open
额外等待懒加载内容(默认:3000ms)
node scripts/scrape.mjs <网址> --profile <名称> --wait 6000
从技能目录运行:
bash
cd ~/.openclaw-yekeen/workspace/skills/browser-scraper/
node scripts/scrape.mjs https://www.reddit.com/
输出
- - JSON输出到stdout:匹配的元素或页面预览
- 截图保存到/tmp/browser-scraper-last.png
关键设计
- - channel: chrome — 可用时启动真实Chrome,否则回退到系统Chromium
- 使用配置文件目录的launchPersistentContext
- --disable-blink-features=AutomationControlled + navigator.webdriver补丁
- 默认headless: false以避免SingletonLock冲突
要求
- - 已安装Playwright:npm install playwright
- 系统上已安装Chrome或Chromium
- 在macOS/Linux上:channel: chrome选项需要安装Chrome(而非Chromium)
提示
- - Chrome不能已经以目标配置文件打开(会出现SingletonLock错误)。先关闭Chrome,或使用命名配置文件避免冲突。
- 如果使用命名配置文件时出现SingletonLock错误,请删除该配置文件目录中的SingletonLock文件,然后重试。
- 使用--keep-open在抓取后保持浏览器打开以便交互使用 — 按Ctrl+C关闭。
- 对于有懒加载内容的网站:使用--wait <毫秒>参数或修改脚本增加waitForTimeout
- 对于Reddit:使用选择器shreddit-post并读取属性(post-title、author、score、permalink)
- 要创建全新的隔离配置文件:在终端中使用--profile-directory=Profile X运行Chrome并登录,然后将抓取工具指向该目录