Xiaohongshu Extract
Overview
Extract note metadata (title, desc, type, time, user, engagement, tags, video stream info) from an XHS share or discovery URL using the bundled script.
Quick Start
Run the extractor and print JSON to stdout:
CODEBLOCK0
Write JSON to a file:
CODEBLOCK1
Output only the flattened record:
CODEBLOCK2
Write only the flattened record to a file:
CODEBLOCK3
Emit errors as JSON:
CODEBLOCK4
Emit errors as JSON to a file:
CODEBLOCK5
Workflow
- 1. Run
scripts/xiaohongshu_extract.py with the user-provided URL. - If the script fails to find
window.__INITIAL_STATE__, ask the user for a direct discovery URL. - Use the JSON output to summarize note metadata or to feed downstream analysis.
Output Notes
The script returns a JSON object with:
- -
note_id, title, desc, type, time, INLINECODE7 - INLINECODE8 (nickname, userid, avatar)
- INLINECODE9 (liked/collected/comment/share counts, plus normalized *num values)
- INLINECODE10
- INLINECODE11 (videoid, duration, width, height, fps, size, streamurl)
- INLINECODE12 (nested-to-flat field name map)
- INLINECODE13 (flattened record with normalized counts and ISO timestamp)
If the stream list is empty, video fields may be null or empty.
If --flat-only is set, only flat is printed. If --error-json is set, errors are emitted as JSON and may include final_url and status_code when available.
Resources
scripts/
- -
scripts/xiaohongshu_extract.py extracts note metadata from XHS share/discovery URLs.
小红书提取工具
概述
使用内置脚本从小红书分享或发现链接中提取笔记元数据(标题、描述、类型、时间、用户、互动数据、标签、视频流信息)。
快速开始
运行提取器并将JSON输出到标准输出:
bash
python scripts/xiaohongshuextract.py url> --pretty
将JSON写入文件:
bash
python scripts/xiaohongshuextract.py url> --output /tmp/xhs_note.json
仅输出扁平化记录:
bash
python scripts/xiaohongshuextract.py url> --flat-only --pretty
仅将扁平化记录写入文件:
bash
python scripts/xiaohongshuextract.py url> --flat-only --output /tmp/xhs_flat.json
将错误信息以JSON格式输出:
bash
python scripts/xiaohongshuextract.py url> --error-json
将错误信息以JSON格式写入文件:
bash
python scripts/xiaohongshuextract.py url> --error-json --output /tmp/xhs_error.json
工作流程
- 1. 使用用户提供的URL运行scripts/xiaohongshuextract.py。
- 如果脚本未能找到window.INITIALSTATE,请用户提供直接发现链接。
- 使用JSON输出总结笔记元数据或供下游分析使用。
输出说明
脚本返回包含以下内容的JSON对象:
- - noteid、title、desc、type、time、iplocation
- user(昵称、用户ID、头像)
- interact(点赞/收藏/评论/分享数,以及标准化后的*num值)
- tags
- video(视频ID、时长、宽度、高度、帧率、大小、流地址)
- fieldmapping(嵌套到扁平字段名称映射)
- flat(包含标准化计数和ISO时间戳的扁平化记录)
如果流列表为空,video字段可能为null或空。
如果设置了--flat-only,则仅输出flat。如果设置了--error-json,错误信息将以JSON格式输出,并在可用时包含finalurl和statuscode。
资源
scripts/
- - scripts/xiaohongshu_extract.py 从小红书分享/发现链接中提取笔记元数据。