Restaurant Review Cross-Check
Cross-reference restaurant data from Xiaohongshu and Dianping to provide validated recommendations.
Quick Start
Query restaurants by location and cuisine type:
CODEBLOCK0
Workflow
1. Data Collection
Query both platforms simultaneously:
Dianping:
- - Fetch restaurants matching location + cuisine
- Extract: name, rating, reviewcount, pricerange, address, tags
Xiaohongshu:
- - Search notes/posts matching location + cuisine
- Extract: restaurantname, engagementmetrics (likes/saves), sentiment_score
- Note: Xiaohongshu data requires scraping as no public API
2. Data Matching
Match restaurants across platforms using fuzzy matching:
- - Restaurant name similarity (Levenshtein distance)
- Location proximity (address matching)
- Handle name variations (e.g., "银座寿司" vs "银座寿司静安店")
See scripts/match_restaurants.py for matching logic.
3. Consistency Analysis
Calculate consistency score based on:
- - Rating correlation (0-1): Correlation between platform ratings
- Engagement validation (0-1): Do high ratings correlate with high engagement?
- Sentiment alignment (0-1): Do user sentiments align across platforms?
Formula: INLINECODE0
4. Recommendation Score
Calculate final recommendation score:
CODEBLOCK1
Output: 0-10 scale, where >8.0 = high confidence recommendation
Output Format
CODEBLOCK2
Thresholds
- - Min rating: 4.0/5.0 (configurable)
- Min reviews: 50 on Dianping, 20 notes on Xiaohongshu (configurable)
- Max results: Top 10 restaurants by recommendation score
- High consistency: Score > 0.7
- Medium consistency: Score 0.5-0.7
- Low consistency: Score < 0.5 (flag for manual review)
API & Data Sources
Dianping
- - Method: Web scraping (Dianping API requires business partnership)
- Base URL: https://www.dianping.com
- Rate limiting: 1 request/2 seconds minimum
- Anti-scraping: Use residential proxies, rotate user agents
See scripts/fetch_dianping.py for implementation.
Xiaohongshu
- - Method: Web scraping (no public API)
- Base URL: https://www.xiaohongshu.com
- Rate limiting: 1 request/3 seconds minimum
- Authentication: Cookies required for full access
See scripts/fetch_xiaohongshu.py for implementation.
Configuration
Edit scripts/config.py to set:
CODEBLOCK3
Error Handling
- - No matches found: Suggest broader search terms or nearby areas
- Platform timeout: Retry with exponential backoff, max 3 attempts
- Rate limiting detected: Pause for 60 seconds, rotate proxy
- Low confidence results: Flag results with consistency < 0.5 for manual review
Advanced Features
Sentiment Analysis
Xiaohongshu posts use NLP to extract:
- - Food quality mentions
- Service quality mentions
- Atmosphere mentions
- Price/value mentions
See references/sentiment_analysis.md for methodology.
Fuzzy Matching
Handle restaurant name variations:
- - Chain stores (e.g., "海底捞火锅" vs "海底捞静安店")
- Abbreviations (e.g., "鼎泰丰" vs "鼎泰丰上海店")
- Translation differences
Uses thefuzz library for similarity scoring.
Dependencies
CODEBLOCK4
See scripts/requirements.txt for complete list.
Troubleshooting
Issue: Xiaohongshu returns empty results
- - Solution: Check if cookies expired, re-authenticate
Issue: Dianping blocks requests
- - Solution: Reduce request rate, rotate proxies
Issue: Poor matching between platforms
- - Solution: Adjust similarity threshold in INLINECODE3
References
餐厅评价交叉验证
交叉参考小红书和大众点评的餐厅数据,提供经过验证的推荐。
快速开始
按地点和菜系类型查询餐厅:
bash
基础查询
crosscheck-restaurants 上海静安区 日式料理
带筛选条件
crosscheck-restaurants 北京朝阳区 火锅 --min-rating 4.5 --min-reviews 100
工作流程
1. 数据收集
同时查询两个平台:
大众点评:
- - 获取匹配地点+菜系的餐厅
- 提取:名称、评分、评价数、价格区间、地址、标签
小红书:
- - 搜索匹配地点+菜系的笔记/帖子
- 提取:餐厅名称、互动指标(点赞/收藏)、情感评分
- 注意:小红书数据需要爬取,无公开API
2. 数据匹配
使用模糊匹配跨平台匹配餐厅:
- - 餐厅名称相似度(莱文斯坦距离)
- 位置邻近度(地址匹配)
- 处理名称变体(例如:银座寿司 vs 银座寿司静安店)
匹配逻辑详见 scripts/match_restaurants.py。
3. 一致性分析
基于以下指标计算一致性评分:
- - 评分相关性(0-1):平台间评分的相关性
- 互动验证(0-1):高评分是否与高互动相关?
- 情感一致性(0-1):用户情感在平台间是否一致?
公式:consistencyscore = (ratingcorr 0.5) + (engagementval 0.3) + (sentimentalign * 0.2)
4. 推荐评分
计算最终推荐评分:
recommendation_score = (
(dianping_rating * 0.4) +
(xhsengagementnormalized * 0.3) +
(consistency_score * 0.3)
) * 10
输出:0-10分制,>8.0 = 高置信度推荐
输出格式
📍 [地点] [菜系类型] 餐厅推荐
- 1. [餐厅名称]
🏆 推荐指数: X.X/10
⭐ 大众点评: X.X (Xk评价)
💬 小红书: X.X⭐ (X笔记)
📍 地址: [地址]
💰 人均: ¥[价格]
✅ 一致性: [高/中/低] - [简要说明]
📊 平台对比:
- 大众点评标签: [标签]
- 小红书热词: [关键词]
⚠️ 注意: [任何差异或警告]
[继续列出前5-10家餐厅...]
阈值设置
- - 最低评分:4.0/5.0(可配置)
- 最低评价数:大众点评50条,小红书20篇笔记(可配置)
- 最大结果数:按推荐评分排序的前10家餐厅
- 高一致性:评分 > 0.7
- 中一致性:评分 0.5-0.7
- 低一致性:评分 < 0.5(标记为需人工审核)
API与数据源
大众点评
- - 方法:网页爬取(大众点评API需要商业合作)
- 基础URL:https://www.dianping.com
- 速率限制:最少每2秒1次请求
- 反爬措施:使用住宅代理,轮换用户代理
实现详见 scripts/fetch_dianping.py。
小红书
- - 方法:网页爬取(无公开API)
- 基础URL:https://www.xiaohongshu.com
- 速率限制:最少每3秒1次请求
- 认证:需要Cookies才能完全访问
实现详见 scripts/fetch_xiaohongshu.py。
配置
编辑 scripts/config.py 设置:
python
DEFAULT_THRESHOLDS = {
min_rating: 4.0,
mindianpingreviews: 50,
minxhsnotes: 20,
max_results: 10
}
PROXY_CONFIG = {
use_proxy: True,
proxy_list: [http://proxy1:port, http://proxy2:port]
}
错误处理
- - 未找到匹配:建议使用更宽泛的搜索词或附近区域
- 平台超时:使用指数退避重试,最多3次尝试
- 检测到速率限制:暂停60秒,轮换代理
- 低置信度结果:标记一致性 < 0.5的结果供人工审核
高级功能
情感分析
小红书帖子使用NLP提取:
- - 食物质量提及
- 服务质量提及
- 氛围提及
- 价格/性价比提及
方法详见 references/sentiment_analysis.md。
模糊匹配
处理餐厅名称变体:
- - 连锁店(例如:海底捞火锅 vs 海底捞静安店)
- 缩写(例如:鼎泰丰 vs 鼎泰丰上海店)
- 翻译差异
使用 thefuzz 库进行相似度评分。
依赖项
bash
pip install requests beautifulsoup4 pandas numpy thefuzz selenium lxml
完整列表详见 scripts/requirements.txt。
故障排除
问题:小红书返回空结果
- - 解决方案:检查Cookies是否过期,重新认证
问题:大众点评阻止请求
问题:平台间匹配效果差
- - 解决方案:调整 match_restaurants.py 中的相似度阈值
参考资料