Playwright Scraper Skill
A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.
🎯 Use Case Matrix
| Target Website | Anti-Bot Level | Recommended Method | Script |
|---|
| Regular Sites | Low | web_fetch tool | N/A (built-in) |
| Dynamic Sites |
Medium | Playwright Simple |
scripts/playwright-simple.js |
|
Cloudflare Protected | High |
Playwright Stealth ⭐ |
scripts/playwright-stealth.js |
|
YouTube | Special | deep-scraper | Install separately |
|
Reddit | Special | reddit-scraper | Install separately |
📦 Installation
CODEBLOCK0
🚀 Quick Start
1️⃣ Simple Sites (No Anti-Bot)
Use OpenClaw's built-in web_fetch tool:
CODEBLOCK1
2️⃣ Dynamic Sites (Requires JavaScript)
Use Playwright Simple:
CODEBLOCK2
Example output:
{
"url": "https://example.com",
"title": "Example Domain",
"content": "...",
"elapsedSeconds": "3.45"
}
3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)
Use Playwright Stealth:
CODEBLOCK4
Features:
- - Hide automation markers (
navigator.webdriver = false) - Realistic User-Agent (iPhone, Android)
- Random delays to mimic human behavior
- Screenshot and HTML saving support
4️⃣ YouTube Video Transcripts
Use deep-scraper (install separately):
CODEBLOCK5
📖 Script Descriptions
scripts/playwright-simple.js
- - Use Case: Regular dynamic websites
- Speed: Fast (3-5 seconds)
- Anti-Bot: None
- Output: JSON (title, content, URL)
scripts/playwright-stealth.js ⭐
- - Use Case: Sites with Cloudflare or anti-bot protection
- Speed: Medium (5-20 seconds)
- Anti-Bot: Medium-High (hides automation, realistic UA)
- Output: JSON + Screenshot + HTML file
- Verified: 100% success on Discuss.com.hk
🎓 Best Practices
1. Try web_fetch First
If the site doesn't have dynamic loading, use OpenClaw's
web_fetch tool—it's fastest.
2. Need JavaScript? Use Playwright Simple
If you need to wait for JavaScript rendering, use
playwright-simple.js.
3. Getting Blocked? Use Stealth
If you encounter 403 or Cloudflare challenges, use
playwright-stealth.js.
4. Special Sites Need Specialized Skills
- - YouTube → deep-scraper
- Reddit → reddit-scraper
- Twitter → bird skill
🔧 Customization
All scripts support environment variables:
CODEBLOCK6
📊 Performance Comparison
| Method | Speed | Anti-Bot | Success Rate (Discuss.com.hk) |
|---|
| web_fetch | ⚡ Fastest | ❌ None | 0% |
| Playwright Simple |
🚀 Fast | ⚠️ Low | 20% |
|
Playwright Stealth | ⏱️ Medium | ✅ Medium |
100% ✅ |
| Puppeteer Stealth | ⏱️ Medium | ✅ Medium-High | ~80% |
| Crawlee (deep-scraper) | 🐢 Slow | ❌ Detected | 0% |
| Chaser (Rust) | ⏱️ Medium | ❌ Detected | 0% |
🛡️ Anti-Bot Techniques Summary
Lessons learned from our testing:
✅ Effective Anti-Bot Measures
- 1. Hide
navigator.webdriver — Essential - Realistic User-Agent — Use real devices (iPhone, Android)
- Mimic Human Behavior — Random delays, scrolling
- Avoid Framework Signatures — Crawlee, Selenium are easily detected
- Use
addInitScript (Playwright) — Inject before page load
❌ Ineffective Anti-Bot Measures
- 1. Only changing User-Agent — Not enough
- Using high-level frameworks (Crawlee) — More easily detected
- Docker isolation — Doesn't help with Cloudflare
🔍 Troubleshooting
Issue: 403 Forbidden
Solution: Use INLINECODE11
Issue: Cloudflare Challenge Page
Solution:
- 1. Increase wait time (10-15 seconds)
- Try
headless: false (headful mode sometimes has higher success rate) - Consider using proxy IPs
Issue: Blank Page
Solution:
- 1. Increase INLINECODE13
- Use
waitUntil: 'networkidle' or INLINECODE15 - Check if login is required
📝 Memory & Experience
2026-02-07 Discuss.com.hk Test Conclusions
- - ✅ Pure Playwright + Stealth succeeded (5s, 200 OK)
- ❌ Crawlee (deep-scraper) failed (403)
- ❌ Chaser (Rust) failed (Cloudflare)
- ❌ Puppeteer standard failed (403)
Best Solution: Pure Playwright + anti-bot techniques (framework-independent)
🚧 Future Improvements
- - [ ] Add proxy IP rotation
- [ ] Implement cookie management (maintain login state)
- [ ] Add CAPTCHA handling (2captcha / Anti-Captcha)
- [ ] Batch scraping (parallel URLs)
- [ ] Integration with OpenClaw's
browser tool
📚 References
Playwright Scraper 技能
一个基于 Playwright 的网页抓取 OpenClaw 技能,具备反机器人保护功能。根据目标网站的反机器人等级选择最佳方案。
🎯 使用场景矩阵
| 目标网站 | 反机器人等级 | 推荐方法 | 脚本 |
|---|
| 常规网站 | 低 | web_fetch 工具 | 无(内置) |
| 动态网站 |
中 | Playwright 简易版 | scripts/playwright-simple.js |
|
Cloudflare 保护 | 高 |
Playwright 隐身版 ⭐ | scripts/playwright-stealth.js |
|
YouTube | 特殊 | deep-scraper | 单独安装 |
|
Reddit | 特殊 | reddit-scraper | 单独安装 |
📦 安装
bash
cd playwright-scraper-skill
npm install
npx playwright install chromium
🚀 快速开始
1️⃣ 简单网站(无反机器人)
使用 OpenClaw 内置的 web_fetch 工具:
bash
直接在 OpenClaw 中调用
嘿,帮我获取 https://example.com 的内容
2️⃣ 动态网站(需要 JavaScript)
使用 Playwright 简易版:
bash
node scripts/playwright-simple.js https://example.com
输出示例:
json
{
url: https://example.com,
title: 示例域名,
content: ...,
elapsedSeconds: 3.45
}
3️⃣ 反机器人保护网站(Cloudflare 等)
使用 Playwright 隐身版:
bash
node scripts/playwright-stealth.js https://m.discuss.com.hk/#hot
功能特点:
- - 隐藏自动化标记(navigator.webdriver = false)
- 真实用户代理(iPhone、Android)
- 随机延迟模拟人类行为
- 支持截图和 HTML 保存
4️⃣ YouTube 视频字幕
使用 deep-scraper(单独安装):
bash
安装 deep-scraper 技能
npx clawhub install deep-scraper
使用它
cd skills/deep-scraper
node assets/youtube
handler.js https://www.youtube.com/watch?v=VIDEOID
📖 脚本说明
scripts/playwright-simple.js
- - 使用场景: 常规动态网站
- 速度: 快速(3-5 秒)
- 反机器人: 无
- 输出: JSON(标题、内容、URL)
scripts/playwright-stealth.js ⭐
- - 使用场景: 具有 Cloudflare 或反机器人保护的网站
- 速度: 中等(5-20 秒)
- 反机器人: 中高(隐藏自动化、真实 UA)
- 输出: JSON + 截图 + HTML 文件
- 已验证: 在 Discuss.com.hk 上 100% 成功
🎓 最佳实践
1. 优先尝试 web_fetch
如果网站没有动态加载,使用 OpenClaw 的 web_fetch 工具——速度最快。
2. 需要 JavaScript?使用 Playwright 简易版
如果需要等待 JavaScript 渲染,使用 playwright-simple.js。
3. 被屏蔽了?使用隐身版
如果遇到 403 或 Cloudflare 验证,使用 playwright-stealth.js。
4. 特殊网站需要专用技能
- - YouTube → deep-scraper
- Reddit → reddit-scraper
- Twitter → bird 技能
🔧 自定义配置
所有脚本支持环境变量:
bash
设置截图路径
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL
设置等待时间(毫秒)
WAIT_TIME=10000 node scripts/playwright-simple.js URL
启用有头模式(显示浏览器)
HEADLESS=false node scripts/playwright-stealth.js URL
保存 HTML
SAVE_HTML=true node scripts/playwright-stealth.js URL
自定义用户代理
USER_AGENT=Mozilla/5.0 ... node scripts/playwright-stealth.js URL
📊 性能对比
| 方法 | 速度 | 反机器人 | 成功率(Discuss.com.hk) |
|---|
| web_fetch | ⚡ 最快 | ❌ 无 | 0% |
| Playwright 简易版 |
🚀 快速 | ⚠️ 低 | 20% |
|
Playwright 隐身版 | ⏱️ 中等 | ✅ 中 |
100% ✅ |
| Puppeteer 隐身版 | ⏱️ 中等 | ✅ 中高 | ~80% |
| Crawlee(deep-scraper) | 🐢 慢 | ❌ 被检测 | 0% |
| Chaser(Rust) | ⏱️ 中等 | ❌ 被检测 | 0% |
🛡️ 反机器人技术总结
从测试中获得的经验:
✅ 有效的反机器人措施
- 1. 隐藏 navigator.webdriver — 必不可少
- 真实用户代理 — 使用真实设备(iPhone、Android)
- 模拟人类行为 — 随机延迟、滚动
- 避免框架特征 — Crawlee、Selenium 容易被检测
- 使用 addInitScript(Playwright) — 在页面加载前注入
❌ 无效的反机器人措施
- 1. 仅更改用户代理 — 不够
- 使用高级框架(Crawlee) — 更容易被检测
- Docker 隔离 — 对 Cloudflare 无效
🔍 故障排除
问题:403 禁止访问
解决方案: 使用 playwright-stealth.js
问题:Cloudflare 验证页面
解决方案:
- 1. 增加等待时间(10-15 秒)
- 尝试 headless: false(有头模式有时成功率更高)
- 考虑使用代理 IP
问题:空白页面
解决方案:
- 1. 增加 waitForTimeout
- 使用 waitUntil: networkidle 或 domcontentloaded
- 检查是否需要登录
📝 记忆与经验
2026-02-07 Discuss.com.hk 测试结论
- - ✅ 纯 Playwright + 隐身版 成功(5 秒,200 OK)
- ❌ Crawlee(deep-scraper)失败(403)
- ❌ Chaser(Rust)失败(Cloudflare)
- ❌ Puppeteer 标准版失败(403)
最佳方案: 纯 Playwright + 反机器人技术(框架无关)
🚧 未来改进
- - [ ] 添加代理 IP 轮换
- [ ] 实现 Cookie 管理(保持登录状态)
- [ ] 添加验证码处理(2captcha / Anti-Captcha)
- [ ] 批量抓取(并行 URL)
- [ ] 与 OpenClaw 的 browser 工具集成
📚 参考资源