Lightpanda Scraper — Fast Headless Browser for OSINT
Blazing fast web scraping using Lightpanda, a Zig-based headless browser. 0.5s per page vs 45s for Chromium/Playwright. Perfect for OSINT recon, link extraction, and content scraping.
Prerequisites
Install Lightpanda binary:
CODEBLOCK0
Quick Start
CODEBLOCK1
Options
- -
--links — Extract and categorize all links from the page - INLINECODE1 — Dump raw HTML instead of markdown
- INLINECODE2 — Include iframe content
- INLINECODE3 — Evaluate JavaScript on the page
- INLINECODE4 — Save output to file
- INLINECODE5 — Wait condition:
networkidle (default), load, INLINECODE8 - INLINECODE9 — Comma-separated resource types to strip:
js, css, INLINECODE12 - INLINECODE13 — Use proxy (e.g.,
socks5://127.0.0.1:9050 for Tor) - INLINECODE15 — Request timeout (default: 30)
- INLINECODE16 — Start CDP server mode
- INLINECODE17 — Start as MCP server (stdio)
Use Cases
OSINT Recon
CODEBLOCK2
Bug Bounty Recon
CODEBLOCK3
Content Extraction
CODEBLOCK4
CDP Server Mode
CODEBLOCK5
Speed Comparison
| Tool | Page Load | Memory | Binary Size |
|---|
| Lightpanda | ~0.5s | ~50MB | ~100MB |
| Chromium/Playwright |
~45s | ~500MB | ~300MB |
| curl/wget | ~0.3s | ~5MB | N/A |
Lightpanda gives you Playwright-like page rendering at near-curl speeds. The catch: no complex JS interactions (use Playwright for those).
Notes
- - Lightpanda is in active development; some complex SPAs may not render perfectly
- For authenticated sessions or complex JS interactions, use Playwright instead
- Binary is ~100MB Zig-compiled native code, runs on Linux x86_64
- Supports HTTP/SOCKS5 proxies for Tor or VPN routing
Lightpanda Scraper — 用于OSINT的快速无头浏览器
使用基于Zig的无头浏览器Lightpanda进行极速网页抓取。每页0.5秒,而Chromium/Playwright需要45秒。完美适用于OSINT侦察、链接提取和内容抓取。
前置条件
安装Lightpanda二进制文件:
bash
mkdir -p ~/.local/bin
curl -L https://github.com/nicholasgasior/lightpanda-browser/releases/latest/download/lightpanda-linux-x86_64 -o ~/.local/bin/lightpanda
chmod +x ~/.local/bin/lightpanda
快速开始
bash
将页面转储为Markdown格式
python3 {baseDir}/scripts/lp-scrape.py https://target.com
提取所有链接
python3 {baseDir}/scripts/lp-scrape.py https://target.com --links
获取原始HTML
python3 {baseDir}/scripts/lp-scrape.py https://target.com --html
选项
- - --links — 提取并分类页面中的所有链接
- --html — 转储原始HTML而非Markdown
- --frames — 包含iframe内容
- --js code — 在页面上执行JavaScript
- --output FILE — 将输出保存到文件
- --wait MODE — 等待条件:networkidle(默认)、load、domcontentloaded
- --strip TYPES — 逗号分隔的要剥离的资源类型:js、css、images
- --proxy URL — 使用代理(例如,Tor使用socks5://127.0.0.1:9050)
- --timeout SECS — 请求超时(默认:30)
- --serve --port PORT — 启动CDP服务器模式
- --mcp — 作为MCP服务器启动(stdio)
使用场景
OSINT侦察
bash
快速页面转储用于分析
python3 {baseDir}/scripts/lp-scrape.py https://target.com > recon.md
提取网站的所有端点
python3 {baseDir}/scripts/lp-scrape.py https://target.com --links | grep -i api
使用Tor爬取
python3 {baseDir}/scripts/lp-scrape.py https://target.com --proxy socks5://127.0.0.1:9050
漏洞赏金侦察
bash
快速子域名内容抓取
for sub in api admin dev staging; do
python3 {baseDir}/scripts/lp-scrape.py https://$sub.target.com --links 2>/dev/null
done
内容提取
bash
保存干净的Markdown
python3 {baseDir}/scripts/lp-scrape.py https://article.com --output article.md
JavaScript评估
python3 {baseDir}/scripts/lp-scrape.py https://app.com --js document.querySelectorAll(a).length
CDP服务器模式
bash
启动服务器以实现编程访问
python3 {baseDir}/scripts/lp-scrape.py --serve --port 9222
然后使用任何CDP客户端连接
速度对比
| 工具 | 页面加载 | 内存 | 二进制文件大小 |
|---|
| Lightpanda | ~0.5秒 | ~50MB | ~100MB |
| Chromium/Playwright |
~45秒 | ~500MB | ~300MB |
| curl/wget | ~0.3秒 | ~5MB | 不适用 |
Lightpanda以接近curl的速度提供类似Playwright的页面渲染。不足之处:不支持复杂的JS交互(这些场景请使用Playwright)。
注意事项
- - Lightpanda正在积极开发中;某些复杂的SPA可能无法完美渲染
- 对于需要认证的会话或复杂的JS交互,请改用Playwright
- 二进制文件约为100MB,由Zig编译为原生代码,运行于Linux x86_64
- 支持HTTP/SOCKS5代理,用于Tor或VPN路由