E-commerce Market Analyzer
Automated workflow for scraping e-commerce websites, handling popups, extracting product data, and generating comprehensive market analysis reports.
Workflow Overview
This skill follows a 4-step workflow:
- 1. Setup & Scraping - Run Playwright scraper to capture homepages
- Visual Analysis - Analyze screenshots to identify product categories
- Data Extraction - Parse HTML to extract specific products and prices
- Report Generation - Create comprehensive market analysis report
CODEBLOCK0
Step 1: Setup & Scraping
Quick Start
When user provides a list of e-commerce websites, immediately run the scraper:
CODEBLOCK1
Customizing the Website List
Edit scripts/scrape_websites.py and update the WEBSITES list:
CODEBLOCK2
Key Features
The scraper automatically:
- - Handles cookie consent popups (German, English, universal selectors)
- Handles region/language selection dialogs
- Captures full-page screenshots (1920x1080)
- Saves HTML source code
- Uses German locale settings (or customize for other markets)
- Waits for page stabilization
Important: The script uses popup patterns from references/popup_patterns.md. Consult this file if dealing with new popup types.
Expected Output
After running, you'll have:
- -
screenshots_clean/*.png - Full-page screenshots - INLINECODE4 - HTML source files
- Console output with success/failure summary
Success rate target: 85-95%
Common failures:
- - Anti-bot protection (requires manual intervention)
- HTTP/2 protocol errors (some sites block automation)
- Timeout on slow-loading sites
Step 2: Visual Analysis
Read Screenshots
After scraping, read the screenshot files to visually identify:
- - Product categories
- Featured products
- Promotional items
- Visual design patterns
Example approach:
CODEBLOCK3
What to Look For
Product Categories:
- - Clothing & Fashion (Bekleidung)
- Electronics (Elektronik)
- Home & Furniture (Möbel & Wohnen)
- Food & Groceries (Lebensmittel)
- Books & Media (Bücher)
- Beauty & Personal Care (Beauty & Pflege)
- Sports & Outdoor (Sport)
- Toys & Baby (Spielzeug & Baby)
Featured Products:
- - Homepage banners
- Promotional sections
- "Deal of the day" items
- New arrivals
Take notes on recurring patterns across multiple sites - these indicate market trends.
Step 3: Data Extraction
Strategy Selection
Choose extraction strategy based on site structure. See references/html_parsing_patterns.md for complete patterns.
Quick decision tree:
- 1. Try JSON-LD schema extraction (best for structured data)
- Fall back to data attribute extraction
- Fall back to class-based extraction
- Last resort: keyword matching
Example: Extract from REWE.de
CODEBLOCK4
Platform-Specific Parsing
Each e-commerce platform has unique HTML structure. Consult references/html_parsing_patterns.md for:
- - Amazon.de patterns
- eBay.de patterns
- Otto.de patterns
- Zalando/AboutYou patterns
- REWE/Lidl supermarket patterns
- And more...
Price Normalization
Always normalize prices:
CODEBLOCK5
Handling Large Files
For HTML files >25k tokens:
CODEBLOCK6
Extraction Best Practices
- 1. Try multiple patterns - Start with JSON-LD, fall back as needed
- Validate extractions - Check for reasonable length (10-100 chars)
- Remove duplicates - Use sets to track seen products
- Limit results - Cap at 10-20 products per site
- Handle encoding - Always use INLINECODE7
Step 4: Report Generation
Use the Report Template
Copy and customize assets/report_template.md:
CODEBLOCK7
Report Structure
The template includes these sections:
- 1. Executive Summary - Key findings
- Top Product Categories - Ranked list with percentages
- Verified Product Prices - Extracted data with exact prices
- Platform-Specific Analysis - Per-site breakdown
- Market Trends - Growth trends and consumer behavior
- Seasonal Characteristics - Current and predicted
- Technical Implementation - Success metrics and limitations
- Business Insights - Opportunities and recommendations
- Data Sources - Success/failure breakdown
- Conclusions - Actionable takeaways
Filling the Template
Replace placeholder tokens:
- -
{MARKET} → German, UK, US, etc. - INLINECODE10 → 23, 25, etc.
- INLINECODE11 → 2026-03-19
- INLINECODE12 → 92
- INLINECODE13 → Clothing & Fashion
- INLINECODE14 → 28
- And so on...
Data Quality Indicators
Include these metrics:
- - Success rate: % of successfully scraped sites
- Popup handling: # of sites with popups handled
- Price accuracy: % of verified prices
- Screenshot quality: Resolution and file size
- HTML completeness: Average file size
Writing Tips
Be bilingual (for German market):
- - Product names: German + Chinese/English translation
- Categories: "Bekleidung / Clothing"
- Maintain both languages throughout
Be specific:
- - ❌ "Electronics are popular"
- ✅ "AirPods 4 (89,90€ on eBay), PlayStation 5, and Samsung smartphones are top electronics"
Include evidence:
- - Reference screenshot file names
- Quote exact prices with sources
- Link specific platforms to products
Troubleshooting
Issue: Popup Not Closed
Solution: Check references/popup_patterns.md for the specific site. Add custom selector if needed:
CODEBLOCK8
Issue: HTML Parsing Returns Empty
Diagnose:
- 1. Check if HTML file exists and has content
- Verify the pattern with grep: INLINECODE16
- Try alternative patterns from INLINECODE17
- Use keyword matching as fallback
Issue: Anti-Bot Detection
Symptoms: CAPTCHA, "Verify you are human", IP blocking
Solutions:
- 1. Add delays between requests (already in script)
- Customize user agent string
- Use browser fingerprinting evasion
- For production: consider proxy rotation (not included)
Issue: Timeout Errors
Solution: Adjust timeout in script:
CODEBLOCK9
Or use more relaxed loading strategy:
await page.goto(url, wait_until="load", timeout=90000)
Market-Specific Configuration
German Market (Default)
CODEBLOCK11
Popup patterns: See references/popup_patterns.md → German Market section
UK Market
CODEBLOCK12
Popup patterns: Use English/International selectors
US Market
CODEBLOCK13
Other Markets
Adjust locale and timezone_id accordingly. Update popup selectors in script based on language.
Advanced Usage
Parallel Scraping
For large website lists, modify script to use concurrent scraping:
CODEBLOCK14
Note: Be respectful of rate limits. Use delays.
Custom Analysis
Beyond the standard workflow, you can:
- - Compare prices across platforms
- Track price changes over time (run periodically)
- Identify pricing patterns (premium vs discount)
- Analyze promotional strategies
- Monitor competitor activity
Exporting Data
Consider exporting to structured formats:
- - CSV: For spreadsheet analysis
- JSON: For programmatic access
- Database: For long-term tracking
Example CSV export:
import csv
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Platform', 'Product', 'Price', 'Category'])
for product in products:
writer.writerow([product['platform'], product['name'],
product['price'], product['category']])
Best Practices
Ethical Scraping
- 1. Respect robots.txt - Check before scraping
- Rate limiting - Don't overwhelm servers (script includes delays)
- Terms of Service - Review site ToS
- Personal use - This skill is for market research, not commercial resale
Data Quality
- 1. Verify prices - Cross-check suspicious values
- Update regularly - E-commerce changes fast
- Document assumptions - Note any manual adjustments
- Keep raw data - Save screenshots and HTML for reference
Report Quality
- 1. Be objective - Base conclusions on data
- Show your work - Reference sources
- Contextualize - Explain market-specific factors
- Actionable - Provide specific recommendations
Resources Reference
scripts/scrape_websites.py
Main scraper with automatic popup handling. Uses Playwright to capture homepages.
Usage: INLINECODE21
references/popup_patterns.md
Comprehensive collection of popup selectors for different markets and platforms.
When to read: When encountering new popup types or troubleshooting popup handling.
references/htmlparsingpatterns.md
Platform-specific HTML parsing patterns and extraction strategies.
When to read: When extracting product data from HTML files. Contains patterns for Amazon, eBay, REWE, Otto, Zalando, and generic strategies.
assets/report_template.md
Structured template for the final market analysis report.
Usage: Copy and fill in with analysis results.
电子商务市场分析器
用于抓取电子商务网站、处理弹窗、提取产品数据并生成全面市场分析报告的自动化工作流。
工作流程概览
本技能遵循4步工作流程:
- 1. 设置与抓取 - 运行Playwright抓取器捕获首页
- 视觉分析 - 分析截图以识别产品类别
- 数据提取 - 解析HTML提取具体产品和价格
- 报告生成 - 创建全面的市场分析报告
用户提供网站列表
↓
步骤1:运行抓取器(自动处理弹窗)
↓
步骤2:视觉分析截图
↓
步骤3:从HTML中提取结构化数据
↓
步骤4:生成最终报告
步骤1:设置与抓取
快速开始
当用户提供电子商务网站列表时,立即运行抓取器:
bash
创建输出目录
mkdir -p screenshots_clean
运行抓取器
uv run python scripts/scrape_websites.py
自定义网站列表
编辑scripts/scrape_websites.py并更新WEBSITES列表:
python
WEBSITES = [
amazon.de,
ebay.de,
otto.de,
# 添加更多网站...
]
主要功能
抓取器自动:
- - 处理Cookie同意弹窗(德语、英语、通用选择器)
- 处理地区/语言选择对话框
- 捕获全页截图(1920x1080)
- 保存HTML源代码
- 使用德语区域设置(或为其他市场自定义)
- 等待页面稳定
重要提示: 脚本使用references/popup_patterns.md中的弹窗模式。处理新型弹窗时请参考此文件。
预期输出
运行后,您将获得:
- - screenshotsclean/.png - 全页截图
- screenshotsclean/.html - HTML源文件
- 控制台输出成功/失败摘要
成功率目标: 85-95%
常见失败原因:
- - 反机器人保护(需要手动干预)
- HTTP/2协议错误(某些网站阻止自动化)
- 加载缓慢的网站超时
步骤2:视觉分析
读取截图
抓取后,读取截图文件以视觉识别:
示例方法:
python
from pathlib import Path
screenshotdir = Path(screenshotsclean)
screenshots = list(screenshot_dir.glob(*.png))
使用读取工具查看截图
for screenshot in screenshots[:5]: # 从5个网站开始
# 使用读取工具查看图片
# 记录产品类别和特色商品
需要关注的内容
产品类别:
- - 服装与时尚(Bekleidung)
- 电子产品(Elektronik)
- 家居与家具(Möbel & Wohnen)
- 食品与杂货(Lebensmittel)
- 图书与媒体(Bücher)
- 美容与个人护理(Beauty & Pflege)
- 运动与户外(Sport)
- 玩具与婴儿用品(Spielzeug & Baby)
特色产品:
记录多个网站中重复出现的模式——这些表明市场趋势。
步骤3:数据提取
策略选择
根据网站结构选择提取策略。完整模式请参见references/htmlparsingpatterns.md。
快速决策树:
- 1. 尝试JSON-LD模式提取(最适合结构化数据)
- 回退到数据属性提取
- 回退到基于类的提取
- 最后手段:关键词匹配
示例:从REWE.de提取
python
import re
from pathlib import Path
htmlfile = Path(screenshotsclean/rewe.de.html)
content = htmlfile.readtext(encoding=utf-8)
REWE特定模式
title_pattern = rdata-offer-title=([^]+)
price
pattern = r_tag-price>([^<]+)
titles = re.findall(title_pattern, content)
prices = re.findall(price_pattern, content)
for i, title in enumerate(titles[:10]):
price = prices[i] if i < len(prices) else N/A
print(f{title}: {price}€)
平台特定解析
每个电子商务平台都有独特的HTML结构。请参考references/htmlparsingpatterns.md了解:
- - Amazon.de模式
- eBay.de模式
- Otto.de模式
- Zalando/AboutYou模式
- REWE/Lidl超市模式
- 以及更多...
价格标准化
始终标准化价格:
python
def normalizeprice(pricestr):
将德语格式(1.234,56€)转换为浮点数
pricestr = pricestr.replace(€, ).replace(EUR, ).strip()
if , in pricestr and . in pricestr:
pricestr = pricestr.replace(., ).replace(,, .)
elif , in price_str:
pricestr = pricestr.replace(,, .)
try:
return float(price_str)
except:
return None
处理大文件
对于超过25k token的HTML文件:
bash
使用grep搜索特定模式
grep -o data-product-name=[^]* amazon.de.html | head -20
或提取特定部分
grep -A 5 product-title ebay.de.html
提取最佳实践
- 1. 尝试多种模式 - 从JSON-LD开始,根据需要回退
- 验证提取结果 - 检查合理长度(10-100字符)
- 去重 - 使用集合跟踪已见产品
- 限制结果 - 每个网站上限10-20个产品
- 处理编码 - 始终使用encoding=utf-8
步骤4:报告生成
使用报告模板
复制并自定义assets/report_template.md:
bash
cp assets/reporttemplate.md finalreport.md
报告结构
模板包含以下部分:
- 1. 执行摘要 - 主要发现
- 热门产品类别 - 带百分比的排名列表
- 已验证的产品价格 - 带精确价格的提取数据
- 平台特定分析 - 按网站细分
- 市场趋势 - 增长趋势和消费者行为
- 季节性特征 - 当前和预测
- 技术实施 - 成功指标和局限性
- 商业洞察 - 机会和建议
- 数据来源 - 成功/失败细分
- 结论 - 可执行的要点
填写模板
替换占位符标记:
- - {MARKET} → 德国、英国、美国等
- {NUMSITES} → 23、25等
- {DATE} → 2026-03-19
- {SUCCESS
RATE} → 92
{CATEGORY1} → 服装与时尚{PERCENTAGE1} → 28以此类推...
数据质量指标
包含以下指标:
- - 成功率:成功抓取的网站百分比
- 弹窗处理:已处理弹窗的网站数量
- 价格准确性:已验证价格的百分比
- 截图质量:分辨率和文件大小
- HTML完整性:平均文件大小
写作技巧
双语写作(针对德国市场):
- - 产品名称:德语 + 中文/英文翻译
- 类别:Bekleidung / 服装
- 全程保持两种语言
具体化:
- - ❌ 电子产品很受欢迎
- ✅ AirPods 4(eBay上89,90€)、PlayStation 5和三星智能手机是热门电子产品
包含证据:
- - 引用截图文件名
- 引用精确价格及来源
- 将特定平台与产品关联
故障排除
问题:弹窗未关闭
解决方案: 检查references/popup_patterns.md中特定网站的内容。如有需要,添加自定义选择器:
python
在scripts/scrapewebsites.py中,添加到popupselectors列表:
popup_selectors = [
# ... 现有选择器 ...
button:has-text(Neue Popup Text), # 添加自定义
]
问题:HTML解析返回空结果
诊断:
- 1. 检查HTML文件是否存在且有内容
- 使用grep验证模式:grep -o your-pattern file.html
- 尝试references/htmlparsingpatterns.md中的替代模式
- 使用关键词匹配作为回退方案
###