Firecrawl Local Skill
Self-hosted Firecrawl integration using the v1 REST API. Tests connectivity first,
executes scrape/crawl/map, handles async crawl polling automatically.
Setup (one-time)
CODEBLOCK0
The script lives at scripts/run.sh in this skill folder — copy it into place as above.
Prerequisites: curl, jq installed. Firecrawl running at localhost:3002.
Optional env vars:
export FIRECRAWL_LOCAL_URL="http://localhost:3002" # default
export FIRECRAWL_API_KEY="fc-your-key" # only needed if auth enabled
Commands
Default — scrape a single page (URL only, no subcommand needed)
CODEBLOCK2
Scrape — explicit, with format options
CODEBLOCK3
Map — discover all URLs on a site
CODEBLOCK4
Crawl — bulk extract multiple pages (async, auto-polled)
firecrawl-local crawl https://docs.example.com
firecrawl-local crawl https://docs.example.com --limit 30 --max-depth 2
firecrawl-local crawl https://docs.example.com --include /docs --exclude /blog
Agent Instructions
When to use each command
| Goal | Command |
|---|
| Get content from one URL (quickest) | INLINECODE4 |
| Discover what pages exist |
map |
| Get content from one URL with format control |
scrape |
| Ingest an entire docs site |
crawl |
| RAG pipeline ingestion |
map → targeted
scrape or
crawl |
Optimal workflows
Documentation RAG pipeline:
CODEBLOCK6
Full site ingestion:
CODEBLOCK7
Parameters
| Flag | Applies to | Description |
|---|
| INLINECODE11 | map, crawl | Max pages (default: 50 for crawl, 500 for map) |
| INLINECODE12 |
crawl | How deep to follow links (default: 2) |
|
--include /path | crawl | Only crawl URLs matching this path prefix |
|
--exclude /path | crawl | Skip URLs matching this path prefix |
|
--formats list | scrape | Comma-separated:
markdown,
html,
rawHtml,
links |
Reading the output
- - scrape: Returns INLINECODE20
- map: Returns INLINECODE21
- crawl: Returns
{success, data: [{url, markdown, metadata}, ...]} ← after polling completes
Failure signals and fixes
| Error | Cause | Fix |
|---|
| INLINECODE23 | Service not running | Start Firecrawl, check port 3002 |
| INLINECODE24 |
Bad URL or blocked | Check URL is reachable, try
--formats html |
| Empty
markdown field | JS-rendered page | Firecrawl handles most JS — check if site blocks bots |
| Crawl times out | Site is large | Reduce
--limit or
--max-depth |
Script reference
See scripts/run.sh for the full implementation. Key design decisions:
- - Health check uses
/health endpoint with 3s timeout - Auth header only sent when
FIRECRAWL_API_KEY is set - Crawl polling retries every 5s up to 60 attempts (5 minutes)
- All parameters are passed via
jq to prevent shell injection in JSON
Firecrawl 本地技能
使用 v1 REST API 的自托管 Firecrawl 集成。首先测试连接性,执行抓取/爬取/映射操作,并自动处理异步爬取轮询。
设置(一次性操作)
bash
mkdir -p ~/.openclaw/skills/firecrawl-local
cp run.sh ~/.openclaw/skills/firecrawl-local/run.sh
chmod +x ~/.openclaw/skills/firecrawl-local/run.sh
脚本位于此技能文件夹的 scripts/run.sh 中——按上述方式复制到目标位置。
前置条件: 已安装 curl、jq。Firecrawl 在 localhost:3002 运行。
可选环境变量:
bash
export FIRECRAWLLOCALURL=http://localhost:3002 # 默认值
export FIRECRAWLAPIKEY=fc-your-key # 仅在启用认证时需要
命令
默认 — 抓取单个页面(仅需 URL,无需子命令)
bash
firecrawl-local https://docs.example.com/api
抓取 — 显式操作,带格式选项
bash
firecrawl-local scrape https://docs.example.com/api
firecrawl-local scrape https://docs.example.com/api --formats markdown,html
映射 — 发现网站上的所有 URL
bash
firecrawl-local map https://docs.example.com
firecrawl-local map https://docs.example.com --limit 200
爬取 — 批量提取多个页面(异步,自动轮询)
bash
firecrawl-local crawl https://docs.example.com
firecrawl-local crawl https://docs.example.com --limit 30 --max-depth 2
firecrawl-local crawl https://docs.example.com --include /docs --exclude /blog
代理指令
何时使用每个命令
| 目标 | 命令 |
|---|
| 获取单个 URL 的内容(最快) | firecrawl-local <url> |
| 发现存在哪些页面 |
map |
| 获取单个 URL 的内容并控制格式 | scrape |
| 摄取整个文档站点 | crawl |
| RAG 流水线摄取 | map → 针对性 scrape 或 crawl |
最佳工作流程
文档 RAG 流水线:
- 1. map https://docs.example.com → 获取完整 URL 列表
- scrape <特定关键页面> → 针对性提取
- 将 markdown 传递给嵌入流水线
完整站点摄取:
- 1. crawl https://docs.example.com --limit 50 --max-depth 3
- 结果自动轮询并以 JSON 数组形式返回 {url, markdown}
参数
| 标志 | 适用范围 | 描述 |
|---|
| --limit N | map, crawl | 最大页面数(默认:crawl 为 50,map 为 500) |
| --max-depth N |
crawl | 链接跟踪深度(默认:2) |
| --include /path | crawl | 仅爬取匹配此路径前缀的 URL |
| --exclude /path | crawl | 跳过匹配此路径前缀的 URL |
| --formats list | scrape | 逗号分隔:markdown、html、rawHtml、links |
读取输出
- - scrape:返回 {success, data: {markdown, html, metadata}}
- map:返回 {success, links: [...]}
- crawl:返回 {success, data: [{url, markdown, metadata}, ...]} ← 轮询完成后
失败信号及修复
| 错误 | 原因 | 修复 |
|---|
| Local Firecrawl unavailable | 服务未运行 | 启动 Firecrawl,检查端口 3002 |
| success: false |
URL 错误或被阻止 | 检查 URL 是否可访问,尝试 --formats html |
| 空的 markdown 字段 | JS 渲染页面 | Firecrawl 处理大多数 JS——检查站点是否屏蔽爬虫 |
| 爬取超时 | 站点过大 | 减少 --limit 或 --max-depth |
脚本参考
完整实现请参见 scripts/run.sh。关键设计决策:
- - 健康检查使用 /health 端点,超时时间 3 秒
- 仅在设置了 FIRECRAWLAPIKEY 时发送认证头
- 爬取轮询每 5 秒重试一次,最多 60 次(5 分钟)
- 所有参数通过 jq 传递,防止 JSON 中的 shell 注入