Playwright Scraper Skill

A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.

🎯 Use Case Matrix

Target Website	Anti-Bot Level	Recommended Method	Script
Regular Sites	Low	web_fetch tool	N/A (built-in)
Dynamic Sites

📦 Installation

CODEBLOCK0

🚀 Quick Start

1️⃣ Simple Sites (No Anti-Bot)

Use OpenClaw's built-in web_fetch tool:

CODEBLOCK1

2️⃣ Dynamic Sites (Requires JavaScript)

Use Playwright Simple:

CODEBLOCK2

Example output:

{
  "url": "https://example.com",
  "title": "Example Domain",
  "content": "...",
  "elapsedSeconds": "3.45"
}

3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)

Use Playwright Stealth:

CODEBLOCK4

Features:

- Hide automation markers (navigator.webdriver = false)
Realistic User-Agent (iPhone, Android)
Random delays to mimic human behavior
Screenshot and HTML saving support

4️⃣ YouTube Video Transcripts

Use deep-scraper (install separately):

CODEBLOCK5

📖 Script Descriptions

`scripts/playwright-simple.js`

- Use Case: Regular dynamic websites
Speed: Fast (3-5 seconds)
Anti-Bot: None
Output: JSON (title, content, URL)

`scripts/playwright-stealth.js` ⭐

- Use Case: Sites with Cloudflare or anti-bot protection
Speed: Medium (5-20 seconds)
Anti-Bot: Medium-High (hides automation, realistic UA)
Output: JSON + Screenshot + HTML file
Verified: 100% success on Discuss.com.hk

🎓 Best Practices

1. Try web_fetch First

If the site doesn't have dynamic loading, use OpenClaw's web_fetch tool—it's fastest.

2. Need JavaScript? Use Playwright Simple

If you need to wait for JavaScript rendering, use playwright-simple.js.

3. Getting Blocked? Use Stealth

If you encounter 403 or Cloudflare challenges, use playwright-stealth.js.

4. Special Sites Need Specialized Skills

- YouTube → deep-scraper
Reddit → reddit-scraper
Twitter → bird skill

🔧 Customization

All scripts support environment variables:

CODEBLOCK6

📊 Performance Comparison

Method	Speed	Anti-Bot	Success Rate (Discuss.com.hk)
web_fetch	⚡ Fastest	❌ None	0%
Playwright Simple

🛡️ Anti-Bot Techniques Summary

Lessons learned from our testing:

✅ Effective Anti-Bot Measures

1. Hide navigator.webdriver — Essential
Realistic User-Agent — Use real devices (iPhone, Android)
Mimic Human Behavior — Random delays, scrolling
Avoid Framework Signatures — Crawlee, Selenium are easily detected
Use addInitScript (Playwright) — Inject before page load

❌ Ineffective Anti-Bot Measures

1. Only changing User-Agent — Not enough
Using high-level frameworks (Crawlee) — More easily detected
Docker isolation — Doesn't help with Cloudflare

🔍 Troubleshooting

Issue: 403 Forbidden

Solution: Use INLINECODE11

Issue: Cloudflare Challenge Page

Solution:

1. Increase wait time (10-15 seconds)
Try headless: false (headful mode sometimes has higher success rate)
Consider using proxy IPs

Issue: Blank Page

Solution:

1. Increase INLINECODE13
Use waitUntil: 'networkidle' or INLINECODE15
Check if login is required

📝 Memory & Experience

2026-02-07 Discuss.com.hk Test Conclusions

- ✅ Pure Playwright + Stealth succeeded (5s, 200 OK)
❌ Crawlee (deep-scraper) failed (403)
❌ Chaser (Rust) failed (Cloudflare)
❌ Puppeteer standard failed (403)

Best Solution: Pure Playwright + anti-bot techniques (framework-independent)

🚧 Future Improvements

- [ ] Add proxy IP rotation
[ ] Implement cookie management (maintain login state)
[ ] Add CAPTCHA handling (2captcha / Anti-Captcha)
[ ] Batch scraping (parallel URLs)
[ ] Integration with OpenClaw's browser tool

📚 References

Playwright Scraper 技能

一个基于 Playwright 的网页抓取 OpenClaw 技能，具备反机器人保护功能。根据目标网站的反机器人等级选择最佳方案。

🎯 使用场景矩阵

目标网站	反机器人等级	推荐方法	脚本
常规网站	低	web_fetch 工具	无（内置）
动态网站

📦 安装

bash
cd playwright-scraper-skill
npm install
npx playwright install chromium

🚀 快速开始

1️⃣ 简单网站（无反机器人）

使用 OpenClaw 内置的 web_fetch 工具：

bash

直接在 OpenClaw 中调用

嘿，帮我获取 https://example.com 的内容

2️⃣ 动态网站（需要 JavaScript）

使用 Playwright 简易版：

bash
node scripts/playwright-simple.js https://example.com

输出示例：
json
{
url: https://example.com,
title: 示例域名,
content: ...,
elapsedSeconds: 3.45
}

3️⃣ 反机器人保护网站（Cloudflare 等）

使用 Playwright 隐身版：

bash
node scripts/playwright-stealth.js https://m.discuss.com.hk/#hot

功能特点：

- 隐藏自动化标记（navigator.webdriver = false）
真实用户代理（iPhone、Android）
随机延迟模拟人类行为
支持截图和 HTML 保存

4️⃣ YouTube 视频字幕

使用 deep-scraper（单独安装）：

bash

安装 deep-scraper 技能

npx clawhub install deep-scraper

使用它

cd skills/deep-scraper node assets/youtubehandler.js https://www.youtube.com/watch?v=VIDEOID

📖 脚本说明

scripts/playwright-simple.js

- 使用场景： 常规动态网站
速度： 快速（3-5 秒）
反机器人： 无
输出： JSON（标题、内容、URL）

scripts/playwright-stealth.js ⭐

- 使用场景： 具有 Cloudflare 或反机器人保护的网站
速度： 中等（5-20 秒）
反机器人： 中高（隐藏自动化、真实 UA）
输出： JSON + 截图 + HTML 文件
已验证： 在 Discuss.com.hk 上 100% 成功

🎓 最佳实践

1. 优先尝试 web_fetch

如果网站没有动态加载，使用 OpenClaw 的 web_fetch 工具——速度最快。

2. 需要 JavaScript？使用 Playwright 简易版

如果需要等待 JavaScript 渲染，使用 playwright-simple.js。

3. 被屏蔽了？使用隐身版

如果遇到 403 或 Cloudflare 验证，使用 playwright-stealth.js。

4. 特殊网站需要专用技能

- YouTube → deep-scraper
Reddit → reddit-scraper
Twitter → bird 技能

🔧 自定义配置

所有脚本支持环境变量：

bash

设置截图路径

SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL

设置等待时间（毫秒）

WAIT_TIME=10000 node scripts/playwright-simple.js URL

启用有头模式（显示浏览器）

HEADLESS=false node scripts/playwright-stealth.js URL

保存 HTML

SAVE_HTML=true node scripts/playwright-stealth.js URL

自定义用户代理

USER_AGENT=Mozilla/5.0 ... node scripts/playwright-stealth.js URL

📊 性能对比

方法	速度	反机器人	成功率（Discuss.com.hk）
web_fetch	⚡ 最快	❌ 无	0%
Playwright 简易版

🚀 快速 | ⚠️ 低 | 20% | | Playwright 隐身版 | ⏱️ 中等 | ✅ 中 | 100% ✅ | | Puppeteer 隐身版 | ⏱️ 中等 | ✅ 中高 | ~80% | | Crawlee（deep-scraper） | 🐢 慢 | ❌ 被检测 | 0% | | Chaser（Rust） | ⏱️ 中等 | ❌ 被检测 | 0% |

🛡️ 反机器人技术总结

从测试中获得的经验：

✅ 有效的反机器人措施

1. 隐藏 navigator.webdriver — 必不可少
真实用户代理 — 使用真实设备（iPhone、Android）
模拟人类行为 — 随机延迟、滚动
避免框架特征 — Crawlee、Selenium 容易被检测
使用 addInitScript（Playwright） — 在页面加载前注入

❌ 无效的反机器人措施

1. 仅更改用户代理 — 不够
使用高级框架（Crawlee） — 更容易被检测
Docker 隔离 — 对 Cloudflare 无效

🔍 故障排除

问题：403 禁止访问

解决方案： 使用 playwright-stealth.js

问题：Cloudflare 验证页面

解决方案：

1. 增加等待时间（10-15 秒）
尝试 headless: false（有头模式有时成功率更高）
考虑使用代理 IP

问题：空白页面

解决方案：

1. 增加 waitForTimeout
使用 waitUntil: networkidle 或 domcontentloaded
检查是否需要登录

📝 记忆与经验

2026-02-07 Discuss.com.hk 测试结论

- ✅ 纯 Playwright + 隐身版 成功（5 秒，200 OK）
❌ Crawlee（deep-scraper）失败（403）
❌ Chaser（Rust）失败（Cloudflare）
❌ Puppeteer 标准版失败（403）

最佳方案： 纯 Playwright + 反机器人技术（框架无关）

🚧 未来改进

- [ ] 添加代理 IP 轮换
[ ] 实现 Cookie 管理（保持登录状态）
[ ] 添加验证码处理（2captcha / Anti-Captcha）
[ ] 批量抓取（并行 URL）
[ ] 与 OpenClaw 的 browser 工具集成

playwright-scraper-skillPlaywright网页抓取