Visible Text Extractor
Use this skill to turn a webpage article, URL, screenshot set, long image set, or local image collection into complete, readable, reusable text.
Core workflow
- 1. Extract visible body text from the main source.
- Discover ordered images and GIF-like assets.
- OCR image content when needed.
- Preserve a raw/audit layer.
- Run a human-first cleanup pass.
- Classify image-like content by likely information type.
- Reconstruct image content into human-readable supplements instead of raw OCR dumps.
- Output polished markdown first; keep raw OCR as JSON or appendix data.
What this skill is good at
- - General webpage article extraction
- WeChat / 公众号 article extraction with special handling
- News pages, blogs, tutorials, explainers, and image-heavy articles
- Screenshots and long-image OCR
- Image directory OCR in display order
- GIF frame extraction plus OCR when
ffmpeg is available - Rebuilding noisy OCR into a cleaner reading version
- Producing either reader-friendly clean output or full transcript-style output
Main script
Supporting resources
- -
scripts/postprocess_ocr_text.py — clean OCR output, merge broken spacing, remove obvious garbage, and regroup into readable sections - INLINECODE3 — browser-rendered fallback for JS-heavy pages
- INLINECODE4 — GIF frame extraction via INLINECODE5
- INLINECODE6 — convert cleaned markdown into a Word document
- INLINECODE7 — convert transcript-style markdown into a Word document
- INLINECODE8 — one-step pipeline for already-authorized browser pages, saved HTML, screenshots, and mixed inputs into clean markdown + JSON + Word deliverable
- INLINECODE9 — one-step pipeline from source input to clean markdown + JSON + Word deliverable
- INLINECODE10 — one-step pipeline for transcript-style full extraction output
- INLINECODE11 — one-step pipeline for reading-order transcript output
- INLINECODE12 — reconstruct WeChat article reading order by interleaving extracted body blocks and image OCR text in original flow order
- INLINECODE13 — higher-accuracy OCR with preprocessing variants and segmented long-image handling
- INLINECODE14 — target output structure and cleanup rules
- INLINECODE15 — one-step deliverable workflow guidance
- INLINECODE16 — failure patterns, environment limits, and how to respond cleanly
- INLINECODE17 — what mature deliverable quality means for this skill
- INLINECODE18 — how to evolve the skill across travel deals, rule pages, event posters, and tutorial long images
- INLINECODE19 — generalized capability contract for article, mixed-media, and screenshot-heavy extraction
Required behavior
When raw OCR is noisy, do not stop at extraction.
- - Keep the raw candidate layer for traceability.
- Prefer readability over raw OCR score when two candidates are close.
- Remove decorative fragments, isolated symbols, repeated garbage, and near-duplicate lines from the polished result.
- Keep uncertainty visible instead of pretending confidence.
- Never silently drop a major section when partial reconstruction is possible.
- Never present raw OCR dump as the final answer if a cleaner reconstruction can be produced.
- Preserve article structure when available: title, subtitle, author/source/time, heading levels, paragraphs, lists, captions, table-like rows, and appended notes.
- Treat information-bearing images as first-class content rather than an appendix afterthought.
- For image-heavy pages, support transcript-style and reading-order outputs in addition to clean article outputs.
WeChat / 公众号 handling
For mp.weixin.qq.com URLs:
- - Try dedicated article extraction first when available.
- Fall back to static HTML parsing.
- Fall back again to browser rendering if needed.
- When the user cares about article readability, prefer reconstructing the final Word output in original reading order instead of appending all image OCR at the end.
- Use
scripts/build_wechat_interleaved_docx.py when the task is specifically “keep original article order” for WeChat posts. - If the page is blocked / validation-gated, report
blocked: true clearly instead of pretending success.
Typical commands
Extract URL to markdown:
CODEBLOCK0
Extract URL to JSON:
CODEBLOCK1
Extract WeChat article with fallbacks:
CODEBLOCK2
Extract local screenshot or long image:
CODEBLOCK3
Run OCR post-processing:
CODEBLOCK4
Run the one-step deliverable pipeline:
CODEBLOCK5
This should emit:
- - INLINECODE23
- INLINECODE24
- INLINECODE25
- INLINECODE26
Run the already-authorized capture pipeline when the page can be opened in a browser or exported/saved first:
CODEBLOCK6
Useful cases:
- - browser can open the page but direct fetch is incomplete
- user provides a saved HTML page plus screenshots
- user wants one command that turns visible page content into a Word document
- user wants status visibility instead of silent long waits
Operational expectations for this pipeline:
- - print stage logs so long OCR jobs do not look stuck
- fail loudly if expected outputs are not created
- detect obvious WeChat validation/interstitial text early
- optionally send the generated docx back to Feishu in one run
- when a source is blocked, stop pretending and switch to authorized-input workflows: saved HTML, screenshots, long images, copied text
Practical optimization rule:
- - do not keep hammering a blocked source in the same mode
- if browser/direct fetch returns validation text, pivot immediately to the best authorized artifact path
- prioritize delivery quality: visible content captured by the user is better than repeated blocked fetch attempts
Key options
- -
--url webpage URL - INLINECODE28 local plain text / markdown input
- INLINECODE29 local saved HTML page
- INLINECODE30 add one local image or GIF; repeat as needed
- INLINECODE31 OCR all supported images / GIFs in a directory
- INLINECODE32 output format
- INLINECODE33 output file path
- INLINECODE34 OCR discovered or provided images
- INLINECODE35 deduplicate repeated merged lines
- INLINECODE36 use browser-rendered fallback for incomplete pages
- INLINECODE37 OCR the browser full-page screenshot as a last resort
- INLINECODE38 conservative GIF handling mode
Quality standard
Default target: produce something a human can read comfortably and share without cleanup.
Release-quality target for article deliverables:
- - preserve the article's original reading order whenever the source structure allows it
- avoid dumping all image OCR at the end when images belong in the middle of the article
- prefer a comfortable reading experience over a mechanically grouped OCR appendix
- keep English-heavy charts, dashboards, and mixed Chinese-English figures readable enough that key labels, axes, legends, and result summaries survive extraction
The skill should increasingly treat extraction as a full article understanding and recovery problem, not only a body scrape plus OCR problem:
- - recover visible article structure from normal webpages, WeChat posts, blogs, tutorials, and mixed-media articles
- infer whether an image is mainly a price/product page, rules page, poster/event page, course outline, scenery/introduction card, or table-like detail page
- pull out high-value facts first when the user wants a clean readable result
- preserve near-complete text when the user wants transcript completeness
- avoid raw OCR dumps as the main deliverable unless the user explicitly wants audit output
When the user explicitly wants completeness, the skill must support a fuller extraction mode:
- - treat each discovered image as a first-class source
- prefer segmented OCR for tall or dense images
- preserve near-complete per-image text blocks before compressing into summaries
- keep summary and full-text layers separate instead of replacing one with the other
- support reading-order transcript output so text and image-derived content can be followed from start to finish
For clean article outputs, prefer a structure like:
- 1. Title
- Metadata (author/source/time) when meaningful
- Main sections in order
- Integrated image-derived supplements where needed
- Uncertainty notes only when necessary
For transcript outputs, prefer a structure like:
- 1. Title
- Intro/body chunks in order
- Image text blocks in order or reading order
- Tail matter / credits / appended notes
Mature-skill rule:
- - default users toward the clean markdown / docx outputs unless they ask for transcript completeness
- keep raw JSON for audit, not as the main deliverable
- degrade honestly when the source is blocked or image quality is poor
- do not optimize only for one article family; keep checking travel-deal posts, rule/scoring posts, event posters, news/blog/tutorial pages, and course-outline long images
Read these references when needed:
- - INLINECODE39
- INLINECODE40
- INLINECODE41
- INLINECODE42
- INLINECODE43
- INLINECODE44
Environment notes
- - OCR depends on the local
ocr-local skill or compatible Tesseract.js setup. - Browser fallback depends on real browser availability plus
playwright-core support. - GIF frame extraction depends on
ffmpeg. - Some pages remain partially inaccessible due to login, anti-bot, or validation flows; mark those limits explicitly.
可见文本提取器
使用此技能将网页文章、URL、截图集、长图集或本地图片集合转化为完整、可读、可复用的文本。
核心工作流程
- 1. 从主要来源提取可见正文文本。
- 发现有序图片和类似GIF的资源。
- 在需要时对图片内容进行OCR识别。
- 保留原始/审计层。
- 执行以人为优先的清理步骤。
- 按可能的信息类型对图片类内容进行分类。
- 将图片内容重构为人类可读的补充内容,而非原始OCR转储。
- 优先输出精炼的Markdown格式;将原始OCR保留为JSON或附录数据。
此技能擅长的领域
- - 通用网页文章提取
- 微信/公众号文章提取(含特殊处理)
- 新闻页面、博客、教程、说明文及图片密集型文章
- 截图和长图OCR识别
- 按显示顺序对图片目录进行OCR识别
- 在ffmpeg可用时进行GIF帧提取及OCR识别
- 将有噪声的OCR结果重建为更清晰的阅读版本
- 生成读者友好的清洁输出或完整的转录风格输出
主脚本
- - scripts/extractvisibletext.py
辅助资源
- - scripts/postprocessocrtext.py — 清理OCR输出,合并断裂间距,移除明显垃圾内容,并重新分组为可读段落
- scripts/extractwithbrowser.js — 针对JS密集型页面的浏览器渲染回退方案
- scripts/extractgifframes.sh — 通过ffmpeg提取GIF帧
- scripts/builddeliverabledocx.js — 将清理后的Markdown转换为Word文档
- scripts/buildtranscriptdocx.js — 将转录风格的Markdown转换为Word文档
- scripts/buildauthorizedcapturedocx.py — 针对已授权的浏览器页面、保存的HTML、截图和混合输入的一步式管道,生成清洁Markdown + JSON + Word交付物
- scripts/extractvisibletextdeliverable.py — 从源输入到清洁Markdown + JSON + Word交付物的一步式管道
- scripts/extractvisibletexttranscriptdeliverable.py — 转录风格完整提取输出的一步式管道
- scripts/extractvisibletextreadingorderdeliverable.py — 阅读顺序转录输出的一步式管道
- scripts/buildwechatinterleaveddocx.py — 通过交错排列提取的正文块和图片OCR文本(按原始流程顺序),重建微信文章阅读顺序
- scripts/ocrhighaccuracy.py — 更高精度的OCR识别,含预处理变体和分段长图处理
- references/output-schema.md — 目标输出结构和清理规则
- references/deliverable-workflow.md — 一步式交付物工作流指南
- references/troubleshooting.md — 失败模式、环境限制及如何优雅应对
- references/product-positioning.md — 此技能的成熟交付物质量意味着什么
- references/generalization-plan.md — 如何将技能扩展到旅游优惠、规则页面、活动海报和教程长图
- references/universal-article-extractor-spec.md — 针对文章、混合媒体和截图密集型提取的通用能力契约
必需行为
当原始OCR有噪声时,不要止步于提取。
- - 保留原始候选层以供追溯。
- 当两个候选结果接近时,优先考虑可读性而非原始OCR得分。
- 从精炼结果中移除装饰性片段、孤立符号、重复垃圾内容和近似重复行。
- 保持不确定性可见,而非假装自信。
- 当部分重建可行时,切勿静默丢弃主要段落。
- 如果能够生成更清洁的重建结果,切勿将原始OCR转储作为最终答案呈现。
- 在可用时保留文章结构:标题、副标题、作者/来源/时间、标题层级、段落、列表、说明文字、表格类行和附加注释。
- 将有信息承载价值的图片视为一等内容,而非事后追加的附录。
- 对于图片密集型页面,除清洁文章输出外,还支持转录风格和阅读顺序输出。
微信/公众号处理
对于mp.weixin.qq.com网址:
- - 首先尝试专用文章提取(当可用时)。
- 回退到静态HTML解析。
- 必要时再回退到浏览器渲染。
- 当用户关心文章可读性时,优先按原始阅读顺序重建最终Word输出,而非将所有图片OCR追加到末尾。
- 当任务明确为保持微信帖子的原始文章顺序时,使用scripts/buildwechatinterleaved_docx.py。
- 如果页面被屏蔽/验证拦截,明确报告blocked: true,而非假装成功。
典型命令
提取URL为Markdown:
bash
python3 {baseDir}/scripts/extractvisibletext.py \
--url https://example.com/post \
--format markdown \
--output result.md
提取URL为JSON:
bash
python3 {baseDir}/scripts/extractvisibletext.py \
--url https://example.com/post \
--format json \
--output result.json
提取微信文章(含回退方案):
bash
python3 {baseDir}/scripts/extractvisibletext.py \
--url https://mp.weixin.qq.com/s/xxxx \
--browser-fallback \
--page-screenshot-ocr \
--format markdown \
--output wechat.md
提取本地截图或长图:
bash
python3 {baseDir}/scripts/extractvisibletext.py \
--image ./screenshot.png \
--ocr-images \
--format markdown \
--output image-result.md
运行OCR后处理:
bash
python3 {baseDir}/scripts/postprocessocrtext.py \
--input-json ./ocr-result.json \
--title Clean Result \
--body-text Optional summary or body text \
--output-json ./clean.json \
--output-markdown ./clean.md
运行一步式交付物管道:
bash
python3 {baseDir}/scripts/extractvisibletext_deliverable.py \
--url https://mp.weixin.qq.com/s/xxxx \
--browser-fallback \
--page-screenshot-ocr \
--ocr-images \
--dedupe \
--output-prefix ./deliverable/result
这将输出:
- - result.raw.json
- result.clean.json
- result.clean.md
- result.docx
当页面可在浏览器中打开或先导出/保存时,运行已授权捕获管道:
bash
python3 {baseDir}/scripts/buildauthorizedcapture_docx.py \
--url https://example.com/page \
--browser-capture \
--ocr-images \
--dedupe \
--output-prefix ./deliverable/captured
适用场景:
- - 浏览器可以打开页面但直接抓取不完整
- 用户提供保存的HTML页面加截图
- 用户希望一条命令将可见页面内容转换为Word文档
- 用户希望看到状态可见性,而非静默长时间等待
此管道的操作预期:
- - 打印阶段日志,使长时间OCR作业看起来不会卡住
- 如果未创建预期输出,则大声失败
- 尽早检测明显的微信验证/插页文本
- 可选地,一次运行将生成的docx发送回飞书
- 当来源被屏蔽时,停止假装并切换到授权输入工作流:保存的HTML、截图、长图、复制的文本
实用优化规则:
- - 不要以相同模式持续攻击被屏蔽的来源
- 如果浏览器/直接抓取返回验证文本,立即转向最佳授权工件路径
- 优先考虑交付质量:用户捕获的可见内容优于重复的被屏蔽抓取尝试
关键选项
- - --url 网页URL
- --text-file 本地纯文本/Markdown输入
- --html-file 本地保存的HTML页面
- --image PATH 添加一个本地图片或GIF;可按需重复
- --image-dir DIR 对目录中所有支持的图片/GIF进行OCR识别
- --format markdown|json 输出格式
- --output PATH 输出文件路径
- --ocr-images 对发现或提供的图片进行OCR识别
- --dedupe 去重重复的合并行
- --browser-fallback 对不完整页面使用浏览器渲染回退方案
- --page-screenshot-ocr 作为最后手段,对浏览器全页截图进行OCR识别
- --gif-mode none|placeholder 保守的GIF处理模式
质量标准
默认目标:生成人类可舒适阅读且无需清理即可分享的内容。
文章交付物的发布级质量目标: