Office Document Assistant
Read, extract, summarize, and compare common office documents:
- - PDF
- Word (
.docx, .doc) - Excel (
.xlsx, .xls) - PowerPoint (
.pptx, .ppt)
Use this skill when the user wants the contents of a document explained, summarized, searched, or extracted into a simpler structure.
When to Use
Use this skill when the user:
- - uploads a
.pdf / .doc / .docx / .xls / .xlsx / .ppt / INLINECODE12 - asks to summarize a document
- asks to extract dates, amounts, contacts, conclusions, specifications, risks, or action items
- asks for page-by-page / slide-by-slide structure
- asks what a spreadsheet or slide deck is saying
- asks to compare two or more documents after extracting their text
When Not to Use
Do not position this skill as a high-fidelity layout or visual analysis system.
It is not ideal for:
- - precise preservation of original layout, formatting, or pagination
- detailed chart / diagram / image interpretation
- password-protected or encrypted files
- OCR-heavy image understanding beyond basic text recovery
- advanced spreadsheet analytics or formula auditing
- tracked-changes / redline reconstruction in Office documents
Core Workflow
- 1. Confirm the document path.
- Run the bundled script:
-
python3 {skill_dir}/scripts/extract_office_text.py <file> --json
- 3. Inspect the JSON fields:
-
type
-
extraction
-
warning
-
truncated
-
text
- 4. Separate clearly in your response:
-
directly extracted content
-
your summary / inference based on that content
- 5. If extraction is empty or weak:
- for PDF, check OCR availability first
- for legacy Office formats, check conversion tools
- 6. If the user asks for a summary, default to:
- one-sentence overview
- 3–8 key points
- extra sections only when clearly present (dates, people, risks, data, conclusions, contacts)
- 7. If the user asks for extraction, prefer structured fields over long prose.
Supported Formats and Strategy
PDF
- - First extract embedded text with
pypdf. - If extracted text is too short, fall back to OCR.
- OCR prefers
chi_sim+eng, then chi_sim, then eng. - OCR pipeline requires both
pdftoppm and tesseract. - If an official first-class PDF tool is exposed in the environment and the task is high-value or multi-PDF, you may prefer that tool; otherwise use this skill's script.
Word
- -
.docx: extract paragraphs and tables directly. - INLINECODE26 : try
antiword, then catdoc, then LibreOffice conversion to .docx.
Excel
- - Extract sheet names and the first rows of each sheet.
- Best for quickly understanding workbook structure and core fields.
- When explaining, focus on what each sheet represents, key columns, important figures, and obvious anomalies.
PowerPoint
- - Extract slide text from shapes.
- Extract speaker notes when present.
- Summaries should usually be slide-by-slide or theme-based, not a giant raw dump.
Tools and Dependencies
Document clearly what is required versus optional.
Required runtime
Required Python packages
- -
pypdf — embedded text extraction from PDFs - INLINECODE32 —
.docx extraction - INLINECODE34 —
.xlsx extraction - INLINECODE36 —
.pptx extraction
Optional but strongly recommended system tools
- -
poppler-utils — provides pdftoppm for PDF → image conversion before OCR - INLINECODE40 — OCR engine
- INLINECODE41 — Simplified Chinese OCR language pack
- INLINECODE42 — conversion fallback for legacy
.doc, .xls, INLINECODE45 - INLINECODE46 — direct
.doc extraction fallback - INLINECODE48 — additional
.doc extraction fallback
What each tool is used for
- -
pypdf: try text-layer extraction from PDFs first - INLINECODE51 : rasterize PDF pages when OCR is needed
- INLINECODE52 : recover text from scanned/image PDFs
- INLINECODE53 : read paragraphs and tables from INLINECODE54
- INLINECODE55 : read sheets and rows from INLINECODE56
- INLINECODE57 : read slide text and notes from INLINECODE58
- INLINECODE59 : convert older Office formats into newer parseable formats
- INLINECODE60 /
catdoc: lightweight extraction options for INLINECODE62
Minimum useful setup
If only modern documents matter, the minimum practical setup is:
- - INLINECODE63
- Python packages:
pypdf, python-docx, openpyxl, INLINECODE67
Recommended full setup
For the most robust behavior across real-world files, install:
- - INLINECODE68
- Python packages:
pypdf, python-docx, openpyxl, INLINECODE72 - system tools:
poppler-utils, tesseract-ocr, tesseract-ocr-chi-sim, libreoffice, antiword, INLINECODE78
Dependency check
Use the bundled checker to quickly see what is missing in the current environment:
CODEBLOCK0
Common Commands
CODEBLOCK1
Useful flags:
CODEBLOCK2
Output Style
Default to a compact answer:
- - one-sentence summary
- 3–8 key points
- then expand only if the user asks for:
- detailed summary
- page-by-page / slide-by-slide notes
- field extraction
- document comparison
Failure Handling
- - If PDF text is empty, suspect scanned pages or missing OCR tools.
- If Chinese OCR is weak, check whether
tesseract-ocr-chi-sim is installed. - If
.doc / .xls / .ppt extraction fails, check libreoffice, antiword, and catdoc. - If tables look messy, explain that this is text-first extraction rather than full layout reconstruction.
- If a file is encrypted or unreadable, say so plainly and stop guessing.
References
Read these only when needed:
- -
references/capabilities.md — capability boundaries and what each format can/can't do well - INLINECODE87 — dependency checks and common failure modes
Office 文档助手
读取、提取、总结和比较常见办公文档:
- - PDF
- Word(.docx、.doc)
- Excel(.xlsx、.xls)
- PowerPoint(.pptx、.ppt)
当用户希望解释、总结、搜索文档内容,或将其提取为更简洁的结构时,使用此技能。
使用时机
在以下情况下使用此技能:
- - 用户上传 .pdf / .doc / .docx / .xls / .xlsx / .ppt / .pptx 文件
- 用户要求总结文档
- 用户要求提取日期、金额、联系人、结论、规格、风险或行动项
- 用户要求按页/按幻灯片展示结构
- 用户询问电子表格或幻灯片的内容
- 用户在提取文本后要求比较两个或多个文档
不使用时机
不要将此技能定位为高保真布局或视觉分析系统。
它不适用于:
- - 精确保留原始布局、格式或分页
- 详细的图表/图示/图像解读
- 受密码保护或加密的文件
- 超出基本文本恢复的强OCR图像理解
- 高级电子表格分析或公式审计
- Office文档中的修订/红线重构
核心工作流程
- 1. 确认文档路径。
- 运行捆绑脚本:
- python3 {skill
dir}/scripts/extractoffice_text.py <文件> --json
- 3. 检查JSON字段:
- type
- extraction
- warning
- truncated
- text
- 4. 在回复中清晰区分:
-
直接提取的内容
-
基于该内容的总结/推断
- 5. 如果提取结果为空或较弱:
- 对于PDF,首先检查OCR可用性
- 对于旧版Office格式,检查转换工具
- 6. 如果用户要求总结,默认提供:
- 一句话概述
- 3–8个关键点
- 仅在明确存在时添加额外部分(日期、人员、风险、数据、结论、联系人)
- 7. 如果用户要求提取,优先使用结构化字段而非长篇叙述。
支持的格式和策略
PDF
- - 首先使用 pypdf 提取嵌入文本。
- 如果提取的文本过短,回退到OCR。
- OCR优先使用 chisim+eng,然后是 chisim,最后是 eng。
- OCR流程需要同时安装 pdftoppm 和 tesseract。
- 如果环境中暴露了官方一流的PDF工具,且任务价值高或多PDF处理,可优先使用该工具;否则使用本技能的脚本。
Word
- - .docx:直接提取段落和表格。
- .doc:依次尝试 antiword、catdoc,然后使用LibreOffice转换为 .docx。
Excel
- - 提取工作表名称和每个工作表的前几行。
- 最适合快速了解工作簿结构和核心字段。
- 解释时,重点说明每个工作表代表什么、关键列、重要数值和明显异常。
PowerPoint
- - 从形状中提取幻灯片文本。
- 提取演讲者备注(如存在)。
- 总结通常应按幻灯片或主题进行,而非大量原始数据转储。
工具和依赖项
清晰说明哪些是必需的,哪些是可选的。
必需运行时
必需的Python包
- - pypdf — 从PDF中提取嵌入文本
- python-docx — 提取 .docx
- openpyxl — 提取 .xlsx
- python-pptx — 提取 .pptx
可选但强烈推荐的系统工具
- - poppler-utils — 提供 pdftoppm,用于PDF→图像转换(OCR前)
- tesseract-ocr — OCR引擎
- tesseract-ocr-chi-sim — 简体中文OCR语言包
- libreoffice — 旧版 .doc、.xls、.ppt 的转换回退方案
- antiword — 直接提取 .doc 的回退方案
- catdoc — 额外的 .doc 提取回退方案
各工具的用途
- - pypdf:首先尝试从PDF中提取文本层
- pdftoppm:需要OCR时将PDF页面栅格化
- tesseract:从扫描/图像PDF中恢复文本
- python-docx:从 .docx 读取段落和表格
- openpyxl:从 .xlsx 读取工作表和行
- python-pptx:从 .pptx 读取幻灯片文本和备注
- libreoffice:将旧版Office格式转换为可解析的新格式
- antiword / catdoc:.doc 的轻量级提取选项
最小实用配置
如果只处理现代文档,最小实用配置为:
- - python3
- Python包:pypdf、python-docx、openpyxl、python-pptx
推荐完整配置
为获得最稳健的处理能力,建议安装:
- - python3
- Python包:pypdf、python-docx、openpyxl、python-pptx
- 系统工具:poppler-utils、tesseract-ocr、tesseract-ocr-chi-sim、libreoffice、antiword、catdoc
依赖检查
使用捆绑的检查器快速查看当前环境中缺少什么:
bash
python3 {skilldir}/scripts/checkdeps.py
常用命令
bash
python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.pdf --json
python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.docx --json
python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.xlsx --json
python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.pptx --json
有用的标志:
bash
限制扫描/提取的PDF页数
python3 {skill
dir}/scripts/extractoffice_text.py /path/to/file.pdf --page-limit 10 --json
探测电子表格时限制每表的行数
python3 {skill
dir}/scripts/extractoffice_text.py /path/to/file.xlsx --row-limit 30 --json
限制输出文本大小
python3 {skill
dir}/scripts/extractoffice_text.py /path/to/file.pdf --max-chars 30000 --json
输出风格
默认提供简洁回答:
- 详细总结
- 逐页/逐幻灯片说明
- 字段提取
- 文档比较
失败处理
- - 如果PDF文本为空,怀疑是扫描页面或缺少OCR工具。
- 如果中文OCR效果不佳,检查是否安装了 tesseract-ocr-chi-sim。
- 如果 .doc / .xls / .ppt 提取失败,检查 libreoffice、antiword 和 catdoc。
- 如果表格显示混乱,说明这是优先文本提取,而非完整布局重建。
- 如果文件加密或无法读取,直接说明并停止猜测。
参考资料
仅在需要时阅读:
- - references/capabilities.md — 能力边界及各格式的优劣
- references/troubleshooting.md — 依赖检查和常见故障模式