E-commerce Market Analyzer

Automated workflow for scraping e-commerce websites, handling popups, extracting product data, and generating comprehensive market analysis reports.

Workflow Overview

This skill follows a 4-step workflow:

1. Setup & Scraping - Run Playwright scraper to capture homepages
Visual Analysis - Analyze screenshots to identify product categories
Data Extraction - Parse HTML to extract specific products and prices
Report Generation - Create comprehensive market analysis report

CODEBLOCK0

Step 1: Setup & Scraping

Quick Start

When user provides a list of e-commerce websites, immediately run the scraper:

CODEBLOCK1

Customizing the Website List

Edit scripts/scrape_websites.py and update the WEBSITES list:

CODEBLOCK2

Key Features

The scraper automatically:

- Handles cookie consent popups (German, English, universal selectors)
Handles region/language selection dialogs
Captures full-page screenshots (1920x1080)
Saves HTML source code
Uses German locale settings (or customize for other markets)
Waits for page stabilization

Important: The script uses popup patterns from references/popup_patterns.md. Consult this file if dealing with new popup types.

Expected Output

After running, you'll have:

- screenshots_clean/*.png - Full-page screenshots
INLINECODE4 - HTML source files
Console output with success/failure summary

Success rate target: 85-95%

Common failures:

- Anti-bot protection (requires manual intervention)
HTTP/2 protocol errors (some sites block automation)
Timeout on slow-loading sites

Step 2: Visual Analysis

Read Screenshots

After scraping, read the screenshot files to visually identify:

- Product categories
Featured products
Promotional items
Visual design patterns

Example approach:
CODEBLOCK3

What to Look For

Product Categories:

- Clothing & Fashion (Bekleidung)
Electronics (Elektronik)
Home & Furniture (Möbel & Wohnen)
Food & Groceries (Lebensmittel)
Books & Media (Bücher)
Beauty & Personal Care (Beauty & Pflege)
Sports & Outdoor (Sport)
Toys & Baby (Spielzeug & Baby)

Featured Products:

- Homepage banners
Promotional sections
"Deal of the day" items
New arrivals

Take notes on recurring patterns across multiple sites - these indicate market trends.

Step 3: Data Extraction

Strategy Selection

Choose extraction strategy based on site structure. See references/html_parsing_patterns.md for complete patterns.

Quick decision tree:

1. Try JSON-LD schema extraction (best for structured data)
Fall back to data attribute extraction
Fall back to class-based extraction
Last resort: keyword matching

Example: Extract from REWE.de

CODEBLOCK4

Platform-Specific Parsing

Each e-commerce platform has unique HTML structure. Consult references/html_parsing_patterns.md for:

- Amazon.de patterns
eBay.de patterns
Otto.de patterns
Zalando/AboutYou patterns
REWE/Lidl supermarket patterns
And more...

Price Normalization

Always normalize prices:
CODEBLOCK5

Handling Large Files

For HTML files >25k tokens:
CODEBLOCK6

Extraction Best Practices

1. Try multiple patterns - Start with JSON-LD, fall back as needed
Validate extractions - Check for reasonable length (10-100 chars)
Remove duplicates - Use sets to track seen products
Limit results - Cap at 10-20 products per site
Handle encoding - Always use INLINECODE7

Step 4: Report Generation

Use the Report Template

Copy and customize assets/report_template.md:

CODEBLOCK7

Report Structure

The template includes these sections:

1. Executive Summary - Key findings
Top Product Categories - Ranked list with percentages
Verified Product Prices - Extracted data with exact prices
Platform-Specific Analysis - Per-site breakdown
Market Trends - Growth trends and consumer behavior
Seasonal Characteristics - Current and predicted
Technical Implementation - Success metrics and limitations
Business Insights - Opportunities and recommendations
Data Sources - Success/failure breakdown
Conclusions - Actionable takeaways

Filling the Template

Replace placeholder tokens:

- {MARKET} → German, UK, US, etc.
INLINECODE10 → 23, 25, etc.
INLINECODE11 → 2026-03-19
INLINECODE12 → 92
INLINECODE13 → Clothing & Fashion
INLINECODE14 → 28
And so on...

Data Quality Indicators

Include these metrics:

- Success rate: % of successfully scraped sites
Popup handling: # of sites with popups handled
Price accuracy: % of verified prices
Screenshot quality: Resolution and file size
HTML completeness: Average file size

Writing Tips

Be bilingual (for German market):

- Product names: German + Chinese/English translation
Categories: "Bekleidung / Clothing"
Maintain both languages throughout

Be specific:

- ❌ "Electronics are popular"
✅ "AirPods 4 (89,90€ on eBay), PlayStation 5, and Samsung smartphones are top electronics"

Include evidence:

- Reference screenshot file names
Quote exact prices with sources
Link specific platforms to products

Troubleshooting

Issue: Popup Not Closed

Solution: Check references/popup_patterns.md for the specific site. Add custom selector if needed:

CODEBLOCK8

Issue: HTML Parsing Returns Empty

Diagnose:

1. Check if HTML file exists and has content
Verify the pattern with grep: INLINECODE16
Try alternative patterns from INLINECODE17
Use keyword matching as fallback

Issue: Anti-Bot Detection

Symptoms: CAPTCHA, "Verify you are human", IP blocking

Solutions:

1. Add delays between requests (already in script)
Customize user agent string
Use browser fingerprinting evasion
For production: consider proxy rotation (not included)

Issue: Timeout Errors

Solution: Adjust timeout in script:
CODEBLOCK9

Or use more relaxed loading strategy:

await page.goto(url, wait_until="load", timeout=90000)

Market-Specific Configuration

German Market (Default)

CODEBLOCK11

Popup patterns: See references/popup_patterns.md → German Market section

UK Market

CODEBLOCK12

Popup patterns: Use English/International selectors

US Market

CODEBLOCK13

Other Markets

Adjust locale and timezone_id accordingly. Update popup selectors in script based on language.

Advanced Usage

Parallel Scraping

For large website lists, modify script to use concurrent scraping:

CODEBLOCK14

Note: Be respectful of rate limits. Use delays.

Custom Analysis

Beyond the standard workflow, you can:

- Compare prices across platforms
Track price changes over time (run periodically)
Identify pricing patterns (premium vs discount)
Analyze promotional strategies
Monitor competitor activity

Exporting Data

Consider exporting to structured formats:

- CSV: For spreadsheet analysis
JSON: For programmatic access
Database: For long-term tracking

Example CSV export:

import csv

with open('products.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Platform', 'Product', 'Price', 'Category'])
    for product in products:
        writer.writerow([product['platform'], product['name'],
                        product['price'], product['category']])

Best Practices

Ethical Scraping

1. Respect robots.txt - Check before scraping
Rate limiting - Don't overwhelm servers (script includes delays)
Terms of Service - Review site ToS
Personal use - This skill is for market research, not commercial resale

Data Quality

1. Verify prices - Cross-check suspicious values
Update regularly - E-commerce changes fast
Document assumptions - Note any manual adjustments
Keep raw data - Save screenshots and HTML for reference

Report Quality

1. Be objective - Base conclusions on data
Show your work - Reference sources
Contextualize - Explain market-specific factors
Actionable - Provide specific recommendations

Resources Reference

scripts/scrape_websites.py

Main scraper with automatic popup handling. Uses Playwright to capture homepages.

Usage: INLINECODE21

references/popup_patterns.md

Comprehensive collection of popup selectors for different markets and platforms.

When to read: When encountering new popup types or troubleshooting popup handling.

references/htmlparsingpatterns.md

Platform-specific HTML parsing patterns and extraction strategies.

When to read: When extracting product data from HTML files. Contains patterns for Amazon, eBay, REWE, Otto, Zalando, and generic strategies.

assets/report_template.md

Structured template for the final market analysis report.

Usage: Copy and fill in with analysis results.

电子商务市场分析器

用于抓取电子商务网站、处理弹窗、提取产品数据并生成全面市场分析报告的自动化工作流。

工作流程概览

本技能遵循4步工作流程：

1. 设置与抓取 - 运行Playwright抓取器捕获首页
视觉分析 - 分析截图以识别产品类别
数据提取 - 解析HTML提取具体产品和价格
报告生成 - 创建全面的市场分析报告

用户提供网站列表
↓
步骤1：运行抓取器（自动处理弹窗）
↓
步骤2：视觉分析截图
↓
步骤3：从HTML中提取结构化数据
↓
步骤4：生成最终报告

步骤1：设置与抓取

快速开始

当用户提供电子商务网站列表时，立即运行抓取器：

bash

创建输出目录

mkdir -p screenshots_clean

运行抓取器

uv run python scripts/scrape_websites.py

自定义网站列表

编辑scripts/scrape_websites.py并更新WEBSITES列表：

python
WEBSITES = [
amazon.de,
ebay.de,
otto.de,
# 添加更多网站...
]

主要功能

抓取器自动：

- 处理Cookie同意弹窗（德语、英语、通用选择器）
处理地区/语言选择对话框
捕获全页截图（1920x1080）
保存HTML源代码
使用德语区域设置（或为其他市场自定义）
等待页面稳定

重要提示： 脚本使用references/popup_patterns.md中的弹窗模式。处理新型弹窗时请参考此文件。

预期输出

运行后，您将获得：

- screenshotsclean/.png - 全页截图
screenshotsclean/.html - HTML源文件
控制台输出成功/失败摘要

成功率目标： 85-95%

常见失败原因：

- 反机器人保护（需要手动干预）
HTTP/2协议错误（某些网站阻止自动化）
加载缓慢的网站超时

步骤2：视觉分析

读取截图

抓取后，读取截图文件以视觉识别：

- 产品类别
特色产品
促销商品
视觉设计模式

示例方法：
python
from pathlib import Path

screenshotdir = Path(screenshotsclean)
screenshots = list(screenshot_dir.glob(*.png))

使用读取工具查看截图

for screenshot in screenshots[:5]: # 从5个网站开始 # 使用读取工具查看图片 # 记录产品类别和特色商品

需要关注的内容

产品类别：

- 服装与时尚（Bekleidung）
电子产品（Elektronik）
家居与家具（Möbel & Wohnen）
食品与杂货（Lebensmittel）
图书与媒体（Bücher）
美容与个人护理（Beauty & Pflege）
运动与户外（Sport）
玩具与婴儿用品（Spielzeug & Baby）

特色产品：

- 首页横幅
促销区域
今日特惠商品
新品上市

记录多个网站中重复出现的模式——这些表明市场趋势。

步骤3：数据提取

策略选择

根据网站结构选择提取策略。完整模式请参见references/htmlparsingpatterns.md。

快速决策树：

1. 尝试JSON-LD模式提取（最适合结构化数据）
回退到数据属性提取
回退到基于类的提取
最后手段：关键词匹配

示例：从REWE.de提取

python
import re
from pathlib import Path

htmlfile = Path(screenshotsclean/rewe.de.html)
content = htmlfile.readtext(encoding=utf-8)

REWE特定模式

title_pattern = rdata-offer-title=([^]+) pricepattern = r
_tag-price>([^<]+)

titles = re.findall(title_pattern, content)
prices = re.findall(price_pattern, content)

for i, title in enumerate(titles[:10]):
price = prices[i] if i < len(prices) else N/A
print(f{title}: {price}€)

平台特定解析

每个电子商务平台都有独特的HTML结构。请参考references/htmlparsingpatterns.md了解：

- Amazon.de模式
eBay.de模式
Otto.de模式
Zalando/AboutYou模式
REWE/Lidl超市模式
以及更多...

价格标准化

始终标准化价格：
python
def normalizeprice(pricestr):
将德语格式（1.234,56€）转换为浮点数
pricestr = pricestr.replace(€, ).replace(EUR, ).strip()
if , in pricestr and . in pricestr:
pricestr = pricestr.replace(., ).replace(,, .)
elif , in price_str:
pricestr = pricestr.replace(,, .)
try:
return float(price_str)
except:
return None

处理大文件

对于超过25k token的HTML文件：
bash

使用grep搜索特定模式

grep -o data-product-name=[^]* amazon.de.html | head -20

或提取特定部分
grep -A 5 product-title ebay.de.html
提取最佳实践

1. 尝试多种模式 - 从JSON-LD开始，根据需要回退
验证提取结果 - 检查合理长度（10-100字符）
去重 - 使用集合跟踪已见产品
限制结果 - 每个网站上限10-20个产品
处理编码 - 始终使用encoding=utf-8

步骤4：报告生成

使用报告模板

复制并自定义assets/report_template.md：

bash
cp assets/reporttemplate.md finalreport.md

报告结构

模板包含以下部分：

1. 执行摘要 - 主要发现
热门产品类别 - 带百分比的排名列表
已验证的产品价格 - 带精确价格的提取数据
平台特定分析 - 按网站细分
市场趋势 - 增长趋势和消费者行为
季节性特征 - 当前和预测
技术实施 - 成功指标和局限性
商业洞察 - 机会和建议
数据来源 - 成功/失败细分
结论 - 可执行的要点

填写模板

替换占位符标记：

- {MARKET} → 德国、英国、美国等
{NUMSITES} → 23、25等
{DATE} → 2026-03-19
{SUCCESSRATE} → 92
{CATEGORY1} → 服装与时尚
{PERCENTAGE1} → 28
以此类推...

数据质量指标

包含以下指标：

- 成功率：成功抓取的网站百分比
弹窗处理：已处理弹窗的网站数量
价格准确性：已验证价格的百分比
截图质量：分辨率和文件大小
HTML完整性：平均文件大小

写作技巧

双语写作（针对德国市场）：

- 产品名称：德语 + 中文/英文翻译
类别：Bekleidung / 服装
全程保持两种语言

具体化：

- ❌ 电子产品很受欢迎
✅ AirPods 4（eBay上89,90€）、PlayStation 5和三星智能手机是热门电子产品

包含证据：

- 引用截图文件名
引用精确价格及来源
将特定平台与产品关联

故障排除

问题：弹窗未关闭

解决方案： 检查references/popup_patterns.md中特定网站的内容。如有需要，添加自定义选择器：

python

在scripts/scrapewebsites.py中，添加到popupselectors列表：

popup_selectors = [
# ... 现有选择器 ...
button:has-text(Neue Popup Text), # 添加自定义
]

问题：HTML解析返回空结果

诊断：

1. 检查HTML文件是否存在且有内容
使用grep验证模式：grep -o your-pattern file.html
尝试references/htmlparsingpatterns.md中的替代模式
使用关键词匹配作为回退方案

###

ecommerce-market-analyzer电商市场分析器