Crawl4AI Docker Skill - Web Crawler & Scraper Service

Dockerized Web Crawling 网页爬虫服务 | REST API 网页爬取 | LLM 智能提取

基于 Docker 部署的 Crawl4AI 网页爬虫服务，提供完整的 REST API 接口，支持智能内容提取和 LLM 优化输出。

🚀 核心功能 | Core Features

- 🐳 Docker 部署 - 容器化服务，端口 11235
🔌 REST API - 完整的 HTTP 接口
🤖 LLM 智能提取 - 支持多种 LLM 提供商
📊 实时监控 - 内置监控面板和 API
⚡ 高性能 - 异步处理，支持并发请求

📋 快速开始 | Quick Start

前提条件 | Prerequisites

确保 Docker Compose 服务正在运行：

CODEBLOCK0

🔌 REST API 使用指南

基础网页抓取 | Basic Web Crawling

简单 Markdown 提取

CODEBLOCK1

带浏览器配置

CODEBLOCK2

LLM 智能提取 | LLM Smart Extraction

内容总结

CODEBLOCK3

结构化数据提取

CODEBLOCK4

高级功能 | Advanced Features

网页截图

CODEBLOCK5

PDF 生成

curl -X POST "http://localhost:11235/pdf" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

📊 API 端点参考 | API Endpoints Reference

核心端点 | Core Endpoints

端点	方法	用途
INLINECODE0	POST	网页抓取和内容提取
INLINECODE1

监控端点 | Monitoring Endpoints

端点	方法	用途
INLINECODE3	GET	系统健康状态
INLINECODE4

工具端点 | Utility Endpoints

端点	方法	用途
INLINECODE6	POST	网页截图
INLINECODE7

🎯 使用场景 | Use Cases

场景 1：文档网站爬取 | Documentation Site Crawling

CODEBLOCK7

场景 2：新闻文章提取 | News Article Extraction

CODEBLOCK8

场景 3：产品信息抓取 | Product Information Scraping

CODEBLOCK9

⚙️ 配置说明 | Configuration

LLM 提供商配置 | LLM Provider Configuration

创建 .llm.env 文件：

CODEBLOCK10

浏览器配置 | Browser Configuration

CODEBLOCK11

📈 响应格式 | Response Format

成功响应 | Success Response

CODEBLOCK12

错误响应 | Error Response

CODEBLOCK13

🔧 故障排除 | Troubleshooting

常见问题 | Common Issues

1. 服务未启动

CODEBLOCK14

2. LLM 提取失败

- 检查 .llm.env 配置
验证 API 密钥
测试不同的 LLM 提供商

3. 网络连接问题

CODEBLOCK15

监控和调试 | Monitoring & Debugging

CODEBLOCK16

🔗 相关链接 | Links

🎉 为什么选择 Docker 版本？

✅ 容器化部署 - 一键启动，环境隔离
✅ REST API - 标准 HTTP 接口，易于集成
✅ 实时监控 - 内置监控面板和 API
✅ 资源管理 - 自动浏览器池管理
✅ 生产就绪 - 企业级稳定性和性能

立即开始使用 Docker 化的 Crawl4AI 服务！ 🚀

Crawl4AI Docker 技能 - 网页爬虫与抓取服务

Docker化网页爬虫服务 | REST API 网页抓取 | LLM 智能提取

基于 Docker 部署的 Crawl4AI 网页爬虫服务，提供完整的 REST API 接口，支持智能内容提取和 LLM 优化输出。

🚀 核心功能

- 🐳 Docker 部署 - 容器化服务，端口 11235
🔌 REST API - 完整的 HTTP 接口
🤖 LLM 智能提取 - 支持多种 LLM 提供商
📊 实时监控 - 内置监控面板和 API
⚡ 高性能 - 异步处理，支持并发请求

📋 快速开始

前提条件

确保 Docker Compose 服务正在运行：

bash

检查服务状态

docker compose ps

健康检查

curl http://localhost:11235/health

访问监控面板

open http://localhost:11235/dashboard

🔌 REST API 使用指南

基础网页抓取

简单 Markdown 提取

bash curl -X POST http://localhost:11235/crawl \ -H Content-Type: application/json \ -d { urls: [https://example.com], extraction_strategy: markdown }

带浏览器配置

bash curl -X POST http://localhost:11235/crawl \ -H Content-Type: application/json \ -d { urls: [https://example.com], extraction_strategy: markdown, browser_config: { headless: true, viewport_width: 1280, viewport_height: 720 } }

LLM 智能提取

内容总结

bash curl -X POST http://localhost:11235/crawl \ -H Content-Type: application/json \ -d { urls: [https://example.com], extraction_strategy: { type: llm, provider: openrouter/free, instruction: 总结网页的主要内容, max_tokens: 1000 } }

结构化数据提取

bash curl -X POST http://localhost:11235/crawl \ -H Content-Type: application/json \ -d { urls: [https://example.com/products], extraction_strategy: { type: llm, provider: openrouter/free, instruction: 提取所有产品名称、价格和描述，返回 JSON 格式, max_tokens: 1500, temperature: 0.1 } }

高级功能

网页截图

bash curl -X POST http://localhost:11235/screenshot \ -H Content-Type: application/json \ -d { url: https://example.com, options: { full_page: true, quality: 80 } }

PDF 生成

bash curl -X POST http://localhost:11235/pdf \ -H Content-Type: application/json \ -d { url: https://example.com }

📊 API 端点参考

核心端点

端点	方法	用途
POST /crawl	POST	网页抓取和内容提取
GET /health

监控端点

端点	方法	用途
GET /monitor/health	GET	系统健康状态
GET /monitor/browsers

工具端点

端点	方法	用途
POST /screenshot	POST	网页截图
POST /pdf

🎯 使用场景

场景 1：文档网站爬取

bash
curl -X POST http://localhost:11235/crawl \
-H Content-Type: application/json \
-d {
urls: [https://docs.openclaw.ai/zh-CN],
extraction_strategy: markdown
}

场景 2：新闻文章提取

bash
curl -X POST http://localhost:11235/crawl \
-H Content-Type: application/json \
-d {
urls: [https://news-site.com/article],
extraction_strategy: {
type: llm,
provider: openrouter/free,
instruction: 提取文章标题、作者、发布时间和主要内容,
max_tokens: 1500
}
}

场景 3：产品信息抓取

bash
curl -X POST http://localhost:11235/crawl \
-H Content-Type: application/json \
-d {
urls: [https://ecommerce-site.com/products],
extraction_strategy: {
type: llm,
provider: openrouter/free,
instruction: 提取所有产品的名称、价格、描述和图片链接,
max_tokens: 2000
}
}

⚙️ 配置说明

LLM 提供商配置

创建 .llm.env 文件：

bash

OpenRouter 配置

OPENROUTERAPIKEY=your-api-key
LLM_PROVIDER=openrouter/free
LLMMAXTOKENS=2000
LLM_TEMPERATURE=0.7

或使用其他提供商

OPENAIAPIKEY=sk-your-key

OPENAIBASEURL=https://your-custom-api.com/v1

LLM_PROVIDER=openai/gpt-4o-mini

浏览器配置

json
{
browser_config: {
headless: true,
viewport_width: 1280,
viewport_height: 720,
user_agent: Mozilla/5.0...
}
}

📈 响应格式

成功响应

json
{
success: true,
results: [
{
url: https://example.com,
markdown: # 提取的 Markdown 内容...,
metadata: {
title: 网页标题,
description: 网页描述,
url: https://example.com
},
extracted_content: {
summary: LLM 提取的内容...
}
}
]
}

错误响应

json
{
success: false,
error: 错误信息,
code: ERROR_CODE
}

🔧 故障排除

常见问题

1. 服务未启动

bash

检查容器状态

docker compose ps

查看日志

docker compose logs crawl4ai

重启服务

docker compose restart crawl4ai

2. LLM 提取失败

- 检查 .llm.env 配置
验证 API 密钥
测试不同的 LLM 提供商

3. 网络连接问题

bash

测试网络连接

curl -I https://example.com

检查代理配置

env | grep -i proxy

监控和调试

bash

访问监控面板

open http://localhost:11235/dashboard

查看系统健康

curl http://localhost:11235/monitor/health

查看浏览器池状态

curl http://localhost:11235/monitor/browsers

🔗 相关链接

🎉 为什么选择 Docker 版本？

立即

crawl4ai-docker-skillDocker爬虫服务