技能名称: Real Estate Spider
详细描述:

房产中介网站爬虫技能

简介

本技能专门用于爬取中国主流房产中介网站数据，包括：

1. 安居客 (anjuke.com)
搜房网 (soufun.com)
贝壳找房 (ke.com)
链家 (lianjia.com)

前置要求

- Python 3.x
agent-browser 技能已安装
requests 库 (可以通过 pip 安装)

安装依赖

bash

安装 Python requests 库

pip install requests beautifulsoup4 lxml

主要功能

1. 反爬虫绕过策略

- 模拟真实浏览器指纹
随机延迟避免频率检测
Cookie 和会话管理
代理 IP 支持（可选）
验证码处理机制

2. 数据提取功能

- 提取房价信息
提取房产面积
提取地理位置
提取户型信息
提取装修状态
提取建筑年代

3. 导出格式

- JSON 格式
CSV 格式
Excel 格式
可视化图表

使用方法

基本爬虫脚本

bash

使用 Python 脚本爬取安居客数据

python3 ~/.openclaw/workspace/skills/real-estate-spider/scripts/anjuke_crawler.py

使用 Shell 脚本配合 agent-browser

bash ~/.openclaw/workspace/skills/real-estate-spider/scripts/bypass_anjuke.sh

配置网站选择

python

配置文件示例

~/.openclaw/workspace/skills/real-estate-spider/config/realestateconfig.py

import json

CONFIG = {
anjuke: {
url: https://www.anjuke.com,
data_selectors: {
price: .property-price,
area: .property-area,
location: .property-location,
type: .property-type
}
},
ke: {
url: https://ke.com,
data_selectors: {
price: .price-text,
area: .area-text,
location: .location-text,
type: .type-text
}
},
lianjia: {
url: https://www.lianjia.com,
data_selectors: {
price: .total-price,
area: .area-num,
location: .location-text,
type: .house-type
}
},
soufun: {
url: https://www.soufun.com,
data_selectors: {
price: .price-num,
area: .area-num,
location: .location-text,
type: .type-text
}
}
}

通用爬虫模板

python

通用爬虫脚本模板

import time
import random
import json
from dataclasses import dataclass

@dataclass
class PropertyData:
title: str
price: str
area: str
location: str
house_type: str
age: str
orientation: str
decoration: str

class RealEstateSpider:
def init(self, website_name):
self.websitename = websitename
self.config = CONFIG[website_name]
self.base_url = self.config[url]
self.selectors = self.config[data_selectors]

def crawl(self, city=北京, district=None):
爬取指定城市和区域的房产数据
# 构建URL
url = self.build_url(city, district)

# 发送请求
data = self.send_request(url)

# 解析数据
properties = self.parse_data(data)

# 返回结果
return properties

def build_url(self, city, district):
构建目标URL
if self.website_name == anjuke:
return f{self.base_url}/fangyuan/{city}
elif self.website_name == ke:
return f{self.base_url}/city/{city}
elif self.website_name == lianjia:
return f{self.base_url}/ershoufang/{city}
elif self.website_name == soufun:
return f{self.base_url}/esf/{city}
else:
return self.base_url

def send_request(self, url):
发送请求，处理反爬虫
headers = {
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36,
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,
Accept-Language: zh-CN,zh;q=0.9,
Accept-Encoding: gzip, deflate, br,
Connection: keep-alive,
Cache-Control: no-cache,
Upgrade-Insecure-Requests: 1
}

# 随机延迟避免频率检测
sleep_time = random.uniform(2, 5)
time.sleep(sleep_time)

# 发送请求（此处为简化示例，实际需要根据网站调整）
import requests
response = requests.get(url, headers=headers)
return response.text

def parsedata(self, htmldata):
解析HTML数据
# 这里需要根据具体网站的HTML结构实现解析逻辑
properties = []

# 示例解析逻辑
import re
pattern = rprice:([\d\.]+),.avgprice:([\d\.]+),.areanum:([\d\.]+),.houseage:([\d年]+),.orient:([^]+),.fitmentname:([^]+),.title:([^]+)
matches = re.findall(pattern, html_data)

for match in matches:
property = PropertyData(
title=match[6],
price=match[0],
area=match[2],
location=, # 需要根据网站调整
house_type=, # 需要根据网站调整
age=match[3],
orientation=match[4],
decoration=match[5]
)
properties.append(property)

return properties

def save_data(self, properties, format=json):
保存数据
if format == json:
with open(f{self.websitename}properties.json, w, encoding=utf-8) as f:
json.dump([prop.dict for prop in properties], f, ensure_ascii=False, indent=2)
elif format == csv:
import csv
with open(f{self.websitename}properties.csv, w, newline=, encoding=utf-8) as f:
writer = csv.writer(f)
writer.writerow([title, price, area, location, house_type, age, orientation, decoration])
for prop in properties:
writer.writerow([prop.title, prop.price, prop.area, prop.location, prop.house_type, prop.age, prop.orientation, prop.decoration])

if name == main:
# 示例：爬取安居客数据
spider = RealEstateSpider(anjuke)
properties = spider.crawl(city=南京)
spider.save_data(properties, format=json)

使用 agent-browser 进行浏览器自动化

bash

使用 agent-browser 绕过JavaScript检测

agent-browser set viewport 1920 1080
agent-browser set headers {
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/146.0.0.0 Safari/537.36,
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8,
Accept-Encoding: gzip, deflate, br,
Cache-Control: no-cache,
Connection: keep-alive,
Upgrade-Insecure-Requests: 1
}

访问房产网站

agent-browser open https://www.anjuke.com agent-browser wait 3000 agent-browser snapshot -i

Real Estate Spider房产数据爬虫

Real Estate Spider

房产中介网站爬虫技能

简介

前置要求

安装依赖

主要功能

1. 反爬虫绕过策略

2. 数据提取功能

3. 导出格式

使用方法

基本爬虫脚本

配置网站选择

通用爬虫模板

使用 agent-browser 进行浏览器自动化

反爬虫策略

1. 浏览器指纹伪装

2. 会话管理

3. 请求频率控制

4. 代理IP轮换

验证码处理

手动处理验证码

验证码识别服务（可选）

数据导出与分析

常见问题与解决方案

1. 网站封锁

2. 验证码频繁出现

3. 数据提取失败

4. JavaScript渲染问题

法律与伦理注意事项

使用本技能时请遵守：

建议使用频率：

持续改进

更新日志：

房产中介网站爬虫技能

简介

前置要求

安装依赖

安装 Python requests 库

主要功能

1. 反爬虫绕过策略

2. 数据提取功能

3. 导出格式

使用方法

基本爬虫脚本

使用 Python 脚本爬取安居客数据

使用 Shell 脚本配合 agent-browser

配置网站选择

配置文件示例

~/.openclaw/workspace/skills/real-estate-spider/config/realestateconfig.py

通用爬虫模板

通用爬虫脚本模板

使用 agent-browser 进行浏览器自动化

使用 agent-browser 绕过JavaScript检测

访问房产网站

模拟人类

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement