Visible Text Extractor

Use this skill to turn a webpage article, URL, screenshot set, long image set, or local image collection into complete, readable, reusable text.

Core workflow

1. Extract visible body text from the main source.
Discover ordered images and GIF-like assets.
OCR image content when needed.
Preserve a raw/audit layer.
Run a human-first cleanup pass.
Classify image-like content by likely information type.
Reconstruct image content into human-readable supplements instead of raw OCR dumps.
Output polished markdown first; keep raw OCR as JSON or appendix data.

What this skill is good at

- General webpage article extraction
WeChat / 公众号 article extraction with special handling
News pages, blogs, tutorials, explainers, and image-heavy articles
Screenshots and long-image OCR
Image directory OCR in display order
GIF frame extraction plus OCR when ffmpeg is available
Rebuilding noisy OCR into a cleaner reading version
Producing either reader-friendly clean output or full transcript-style output

Main script

- INLINECODE1

Supporting resources

- scripts/postprocess_ocr_text.py — clean OCR output, merge broken spacing, remove obvious garbage, and regroup into readable sections
INLINECODE3 — browser-rendered fallback for JS-heavy pages
INLINECODE4 — GIF frame extraction via INLINECODE5
INLINECODE6 — convert cleaned markdown into a Word document
INLINECODE7 — convert transcript-style markdown into a Word document
INLINECODE8 — one-step pipeline for already-authorized browser pages, saved HTML, screenshots, and mixed inputs into clean markdown + JSON + Word deliverable
INLINECODE9 — one-step pipeline from source input to clean markdown + JSON + Word deliverable
INLINECODE10 — one-step pipeline for transcript-style full extraction output
INLINECODE11 — one-step pipeline for reading-order transcript output
INLINECODE12 — reconstruct WeChat article reading order by interleaving extracted body blocks and image OCR text in original flow order
INLINECODE13 — higher-accuracy OCR with preprocessing variants and segmented long-image handling
INLINECODE14 — target output structure and cleanup rules
INLINECODE15 — one-step deliverable workflow guidance
INLINECODE16 — failure patterns, environment limits, and how to respond cleanly
INLINECODE17 — what mature deliverable quality means for this skill
INLINECODE18 — how to evolve the skill across travel deals, rule pages, event posters, and tutorial long images
INLINECODE19 — generalized capability contract for article, mixed-media, and screenshot-heavy extraction

Required behavior

When raw OCR is noisy, do not stop at extraction.

- Keep the raw candidate layer for traceability.
Prefer readability over raw OCR score when two candidates are close.
Remove decorative fragments, isolated symbols, repeated garbage, and near-duplicate lines from the polished result.
Keep uncertainty visible instead of pretending confidence.
Never silently drop a major section when partial reconstruction is possible.
Never present raw OCR dump as the final answer if a cleaner reconstruction can be produced.
Preserve article structure when available: title, subtitle, author/source/time, heading levels, paragraphs, lists, captions, table-like rows, and appended notes.
Treat information-bearing images as first-class content rather than an appendix afterthought.
For image-heavy pages, support transcript-style and reading-order outputs in addition to clean article outputs.

WeChat / 公众号 handling

For mp.weixin.qq.com URLs:

- Try dedicated article extraction first when available.
Fall back to static HTML parsing.
Fall back again to browser rendering if needed.
When the user cares about article readability, prefer reconstructing the final Word output in original reading order instead of appending all image OCR at the end.
Use scripts/build_wechat_interleaved_docx.py when the task is specifically “keep original article order” for WeChat posts.
If the page is blocked / validation-gated, report blocked: true clearly instead of pretending success.

Typical commands

Extract URL to markdown:

CODEBLOCK0

Extract URL to JSON:

CODEBLOCK1

Extract WeChat article with fallbacks:

CODEBLOCK2

Extract local screenshot or long image:

CODEBLOCK3

Run OCR post-processing:

CODEBLOCK4

Run the one-step deliverable pipeline:

CODEBLOCK5

This should emit:

- INLINECODE23
INLINECODE24
INLINECODE25
INLINECODE26

Run the already-authorized capture pipeline when the page can be opened in a browser or exported/saved first:

CODEBLOCK6

Useful cases:

- browser can open the page but direct fetch is incomplete
user provides a saved HTML page plus screenshots
user wants one command that turns visible page content into a Word document
user wants status visibility instead of silent long waits

Operational expectations for this pipeline:

- print stage logs so long OCR jobs do not look stuck
fail loudly if expected outputs are not created
detect obvious WeChat validation/interstitial text early
optionally send the generated docx back to Feishu in one run
when a source is blocked, stop pretending and switch to authorized-input workflows: saved HTML, screenshots, long images, copied text

Practical optimization rule:

- do not keep hammering a blocked source in the same mode
if browser/direct fetch returns validation text, pivot immediately to the best authorized artifact path
prioritize delivery quality: visible content captured by the user is better than repeated blocked fetch attempts

Key options

- --url webpage URL
INLINECODE28 local plain text / markdown input
INLINECODE29 local saved HTML page
INLINECODE30 add one local image or GIF; repeat as needed
INLINECODE31 OCR all supported images / GIFs in a directory
INLINECODE32 output format
INLINECODE33 output file path
INLINECODE34 OCR discovered or provided images
INLINECODE35 deduplicate repeated merged lines
INLINECODE36 use browser-rendered fallback for incomplete pages
INLINECODE37 OCR the browser full-page screenshot as a last resort
INLINECODE38 conservative GIF handling mode

Quality standard

Default target: produce something a human can read comfortably and share without cleanup.

Release-quality target for article deliverables:

- preserve the article's original reading order whenever the source structure allows it
avoid dumping all image OCR at the end when images belong in the middle of the article
prefer a comfortable reading experience over a mechanically grouped OCR appendix
keep English-heavy charts, dashboards, and mixed Chinese-English figures readable enough that key labels, axes, legends, and result summaries survive extraction

The skill should increasingly treat extraction as a full article understanding and recovery problem, not only a body scrape plus OCR problem:

- recover visible article structure from normal webpages, WeChat posts, blogs, tutorials, and mixed-media articles
infer whether an image is mainly a price/product page, rules page, poster/event page, course outline, scenery/introduction card, or table-like detail page
pull out high-value facts first when the user wants a clean readable result
preserve near-complete text when the user wants transcript completeness
avoid raw OCR dumps as the main deliverable unless the user explicitly wants audit output

When the user explicitly wants completeness, the skill must support a fuller extraction mode:

- treat each discovered image as a first-class source
prefer segmented OCR for tall or dense images
preserve near-complete per-image text blocks before compressing into summaries
keep summary and full-text layers separate instead of replacing one with the other
support reading-order transcript output so text and image-derived content can be followed from start to finish

For clean article outputs, prefer a structure like:

1. Title
Metadata (author/source/time) when meaningful
Main sections in order
Integrated image-derived supplements where needed
Uncertainty notes only when necessary

For transcript outputs, prefer a structure like:

1. Title
Intro/body chunks in order
Image text blocks in order or reading order
Tail matter / credits / appended notes

Mature-skill rule:

- default users toward the clean markdown / docx outputs unless they ask for transcript completeness
keep raw JSON for audit, not as the main deliverable
degrade honestly when the source is blocked or image quality is poor
do not optimize only for one article family; keep checking travel-deal posts, rule/scoring posts, event posters, news/blog/tutorial pages, and course-outline long images

Read these references when needed:

- INLINECODE39
INLINECODE40
INLINECODE41
INLINECODE42
INLINECODE43
INLINECODE44

Environment notes

- OCR depends on the local ocr-local skill or compatible Tesseract.js setup.
Browser fallback depends on real browser availability plus playwright-core support.
GIF frame extraction depends on ffmpeg.
Some pages remain partially inaccessible due to login, anti-bot, or validation flows; mark those limits explicitly.

可见文本提取器

使用此技能将网页文章、URL、截图集、长图集或本地图片集合转化为完整、可读、可复用的文本。

核心工作流程

1. 从主要来源提取可见正文文本。
发现有序图片和类似GIF的资源。
在需要时对图片内容进行OCR识别。
保留原始/审计层。
执行以人为优先的清理步骤。
按可能的信息类型对图片类内容进行分类。
将图片内容重构为人类可读的补充内容，而非原始OCR转储。
优先输出精炼的Markdown格式；将原始OCR保留为JSON或附录数据。

此技能擅长的领域

- 通用网页文章提取
微信/公众号文章提取（含特殊处理）
新闻页面、博客、教程、说明文及图片密集型文章
截图和长图OCR识别
按显示顺序对图片目录进行OCR识别
在ffmpeg可用时进行GIF帧提取及OCR识别
将有噪声的OCR结果重建为更清晰的阅读版本
生成读者友好的清洁输出或完整的转录风格输出

主脚本

- scripts/extractvisibletext.py

辅助资源

- scripts/postprocessocrtext.py — 清理OCR输出，合并断裂间距，移除明显垃圾内容，并重新分组为可读段落
scripts/extractwithbrowser.js — 针对JS密集型页面的浏览器渲染回退方案
scripts/extractgifframes.sh — 通过ffmpeg提取GIF帧
scripts/builddeliverabledocx.js — 将清理后的Markdown转换为Word文档
scripts/buildtranscriptdocx.js — 将转录风格的Markdown转换为Word文档
scripts/buildauthorizedcapturedocx.py — 针对已授权的浏览器页面、保存的HTML、截图和混合输入的一步式管道，生成清洁Markdown + JSON + Word交付物
scripts/extractvisibletextdeliverable.py — 从源输入到清洁Markdown + JSON + Word交付物的一步式管道
scripts/extractvisibletexttranscriptdeliverable.py — 转录风格完整提取输出的一步式管道
scripts/extractvisibletextreadingorderdeliverable.py — 阅读顺序转录输出的一步式管道
scripts/buildwechatinterleaveddocx.py — 通过交错排列提取的正文块和图片OCR文本（按原始流程顺序），重建微信文章阅读顺序
scripts/ocrhighaccuracy.py — 更高精度的OCR识别，含预处理变体和分段长图处理
references/output-schema.md — 目标输出结构和清理规则
references/deliverable-workflow.md — 一步式交付物工作流指南
references/troubleshooting.md — 失败模式、环境限制及如何优雅应对
references/product-positioning.md — 此技能的成熟交付物质量意味着什么
references/generalization-plan.md — 如何将技能扩展到旅游优惠、规则页面、活动海报和教程长图
references/universal-article-extractor-spec.md — 针对文章、混合媒体和截图密集型提取的通用能力契约

必需行为

当原始OCR有噪声时，不要止步于提取。

- 保留原始候选层以供追溯。
当两个候选结果接近时，优先考虑可读性而非原始OCR得分。
从精炼结果中移除装饰性片段、孤立符号、重复垃圾内容和近似重复行。
保持不确定性可见，而非假装自信。
当部分重建可行时，切勿静默丢弃主要段落。
如果能够生成更清洁的重建结果，切勿将原始OCR转储作为最终答案呈现。
在可用时保留文章结构：标题、副标题、作者/来源/时间、标题层级、段落、列表、说明文字、表格类行和附加注释。
将有信息承载价值的图片视为一等内容，而非事后追加的附录。
对于图片密集型页面，除清洁文章输出外，还支持转录风格和阅读顺序输出。

微信/公众号处理

对于mp.weixin.qq.com网址：

- 首先尝试专用文章提取（当可用时）。
回退到静态HTML解析。
必要时再回退到浏览器渲染。
当用户关心文章可读性时，优先按原始阅读顺序重建最终Word输出，而非将所有图片OCR追加到末尾。
当任务明确为保持微信帖子的原始文章顺序时，使用scripts/buildwechatinterleaved_docx.py。
如果页面被屏蔽/验证拦截，明确报告blocked: true，而非假装成功。

典型命令

提取URL为Markdown：

bash
python3 {baseDir}/scripts/extractvisibletext.py \
--url https://example.com/post \
--format markdown \
--output result.md

提取URL为JSON：

bash
python3 {baseDir}/scripts/extractvisibletext.py \
--url https://example.com/post \
--format json \
--output result.json

提取微信文章（含回退方案）：

bash
python3 {baseDir}/scripts/extractvisibletext.py \
--url https://mp.weixin.qq.com/s/xxxx \
--browser-fallback \
--page-screenshot-ocr \
--format markdown \
--output wechat.md

提取本地截图或长图：

bash
python3 {baseDir}/scripts/extractvisibletext.py \
--image ./screenshot.png \
--ocr-images \
--format markdown \
--output image-result.md

运行OCR后处理：

bash
python3 {baseDir}/scripts/postprocessocrtext.py \
--input-json ./ocr-result.json \
--title Clean Result \
--body-text Optional summary or body text \
--output-json ./clean.json \
--output-markdown ./clean.md

运行一步式交付物管道：

bash
python3 {baseDir}/scripts/extractvisibletext_deliverable.py \
--url https://mp.weixin.qq.com/s/xxxx \
--browser-fallback \
--page-screenshot-ocr \
--ocr-images \
--dedupe \
--output-prefix ./deliverable/result

这将输出：

- result.raw.json
result.clean.json
result.clean.md
result.docx

当页面可在浏览器中打开或先导出/保存时，运行已授权捕获管道：

bash
python3 {baseDir}/scripts/buildauthorizedcapture_docx.py \
--url https://example.com/page \
--browser-capture \
--ocr-images \
--dedupe \
--output-prefix ./deliverable/captured

适用场景：

- 浏览器可以打开页面但直接抓取不完整
用户提供保存的HTML页面加截图
用户希望一条命令将可见页面内容转换为Word文档
用户希望看到状态可见性，而非静默长时间等待

此管道的操作预期：

- 打印阶段日志，使长时间OCR作业看起来不会卡住
如果未创建预期输出，则大声失败
尽早检测明显的微信验证/插页文本
可选地，一次运行将生成的docx发送回飞书
当来源被屏蔽时，停止假装并切换到授权输入工作流：保存的HTML、截图、长图、复制的文本

实用优化规则：

- 不要以相同模式持续攻击被屏蔽的来源
如果浏览器/直接抓取返回验证文本，立即转向最佳授权工件路径
优先考虑交付质量：用户捕获的可见内容优于重复的被屏蔽抓取尝试

关键选项

- --url 网页URL
--text-file 本地纯文本/Markdown输入
--html-file 本地保存的HTML页面
--image PATH 添加一个本地图片或GIF；可按需重复
--image-dir DIR 对目录中所有支持的图片/GIF进行OCR识别
--format markdown|json 输出格式
--output PATH 输出文件路径
--ocr-images 对发现或提供的图片进行OCR识别
--dedupe 去重重复的合并行
--browser-fallback 对不完整页面使用浏览器渲染回退方案
--page-screenshot-ocr 作为最后手段，对浏览器全页截图进行OCR识别
--gif-mode none|placeholder 保守的GIF处理模式

质量标准

默认目标：生成人类可舒适阅读且无需清理即可分享的内容。

文章交付物的发布级质量目标：

- 只要源结构允许

visible-text-extractor可见文本提取