Office Document Assistant

Read, extract, summarize, and compare common office documents:

- PDF
Word (.docx, .doc)
Excel (.xlsx, .xls)
PowerPoint (.pptx, .ppt)

Use this skill when the user wants the contents of a document explained, summarized, searched, or extracted into a simpler structure.

When to Use

Use this skill when the user:

- uploads a .pdf / .doc / .docx / .xls / .xlsx / .ppt / INLINECODE12
asks to summarize a document
asks to extract dates, amounts, contacts, conclusions, specifications, risks, or action items
asks for page-by-page / slide-by-slide structure
asks what a spreadsheet or slide deck is saying
asks to compare two or more documents after extracting their text

When Not to Use

Do not position this skill as a high-fidelity layout or visual analysis system.

It is not ideal for:

- precise preservation of original layout, formatting, or pagination
detailed chart / diagram / image interpretation
password-protected or encrypted files
OCR-heavy image understanding beyond basic text recovery
advanced spreadsheet analytics or formula auditing
tracked-changes / redline reconstruction in Office documents

Core Workflow

1. Confirm the document path.
Run the bundled script:

- python3 {skill_dir}/scripts/extract_office_text.py <file> --json

3. Inspect the JSON fields:

- type - extraction - warning - truncated - text

4. Separate clearly in your response:

- directly extracted content - your summary / inference based on that content

5. If extraction is empty or weak:

- for PDF, check OCR availability first - for legacy Office formats, check conversion tools

6. If the user asks for a summary, default to:

- one-sentence overview - 3–8 key points - extra sections only when clearly present (dates, people, risks, data, conclusions, contacts)

7. If the user asks for extraction, prefer structured fields over long prose.

Supported Formats and Strategy

PDF

- First extract embedded text with pypdf.
If extracted text is too short, fall back to OCR.
OCR prefers chi_sim+eng, then chi_sim, then eng.
OCR pipeline requires both pdftoppm and tesseract.
If an official first-class PDF tool is exposed in the environment and the task is high-value or multi-PDF, you may prefer that tool; otherwise use this skill's script.

Word

- .docx: extract paragraphs and tables directly.
INLINECODE26: try antiword, then catdoc, then LibreOffice conversion to .docx.

Excel

- Extract sheet names and the first rows of each sheet.
Best for quickly understanding workbook structure and core fields.
When explaining, focus on what each sheet represents, key columns, important figures, and obvious anomalies.

PowerPoint

- Extract slide text from shapes.
Extract speaker notes when present.
Summaries should usually be slide-by-slide or theme-based, not a giant raw dump.

Tools and Dependencies

Document clearly what is required versus optional.

Required runtime

- INLINECODE30

Required Python packages

- pypdf — embedded text extraction from PDFs
INLINECODE32 — .docx extraction
INLINECODE34 — .xlsx extraction
INLINECODE36 — .pptx extraction

Optional but strongly recommended system tools

- poppler-utils — provides pdftoppm for PDF → image conversion before OCR
INLINECODE40 — OCR engine
INLINECODE41 — Simplified Chinese OCR language pack
INLINECODE42 — conversion fallback for legacy .doc, .xls, INLINECODE45
INLINECODE46 — direct .doc extraction fallback
INLINECODE48 — additional .doc extraction fallback

What each tool is used for

- pypdf: try text-layer extraction from PDFs first
INLINECODE51: rasterize PDF pages when OCR is needed
INLINECODE52: recover text from scanned/image PDFs
INLINECODE53: read paragraphs and tables from INLINECODE54
INLINECODE55: read sheets and rows from INLINECODE56
INLINECODE57: read slide text and notes from INLINECODE58
INLINECODE59: convert older Office formats into newer parseable formats
INLINECODE60 / catdoc: lightweight extraction options for INLINECODE62

Minimum useful setup

If only modern documents matter, the minimum practical setup is:

- INLINECODE63
Python packages: pypdf, python-docx, openpyxl, INLINECODE67

Recommended full setup

For the most robust behavior across real-world files, install:

- INLINECODE68
Python packages: pypdf, python-docx, openpyxl, INLINECODE72
system tools: poppler-utils, tesseract-ocr, tesseract-ocr-chi-sim, libreoffice, antiword, INLINECODE78

Dependency check

Use the bundled checker to quickly see what is missing in the current environment:

CODEBLOCK0

Common Commands

CODEBLOCK1

Useful flags:

CODEBLOCK2

Output Style

Default to a compact answer:

- one-sentence summary
3–8 key points
then expand only if the user asks for:

- detailed summary
- page-by-page / slide-by-slide notes
- field extraction
- document comparison

Failure Handling

- If PDF text is empty, suspect scanned pages or missing OCR tools.
If Chinese OCR is weak, check whether tesseract-ocr-chi-sim is installed.
If .doc / .xls / .ppt extraction fails, check libreoffice, antiword, and catdoc.
If tables look messy, explain that this is text-first extraction rather than full layout reconstruction.
If a file is encrypted or unreadable, say so plainly and stop guessing.

References

Read these only when needed:

- references/capabilities.md — capability boundaries and what each format can/can't do well
INLINECODE87 — dependency checks and common failure modes

Office 文档助手

读取、提取、总结和比较常见办公文档：

- PDF
Word（.docx、.doc）
Excel（.xlsx、.xls）
PowerPoint（.pptx、.ppt）

当用户希望解释、总结、搜索文档内容，或将其提取为更简洁的结构时，使用此技能。

使用时机

在以下情况下使用此技能：

- 用户上传 .pdf / .doc / .docx / .xls / .xlsx / .ppt / .pptx 文件
用户要求总结文档
用户要求提取日期、金额、联系人、结论、规格、风险或行动项
用户要求按页/按幻灯片展示结构
用户询问电子表格或幻灯片的内容
用户在提取文本后要求比较两个或多个文档

不使用时机

不要将此技能定位为高保真布局或视觉分析系统。

它不适用于：

- 精确保留原始布局、格式或分页
详细的图表/图示/图像解读
受密码保护或加密的文件
超出基本文本恢复的强OCR图像理解
高级电子表格分析或公式审计
Office文档中的修订/红线重构

核心工作流程

1. 确认文档路径。
运行捆绑脚本：

- python3 {skilldir}/scripts/extractoffice_text.py <文件> --json

3. 检查JSON字段：

- type - extraction - warning - truncated - text

4. 在回复中清晰区分：

- 直接提取的内容 - 基于该内容的总结/推断

5. 如果提取结果为空或较弱：

- 对于PDF，首先检查OCR可用性 - 对于旧版Office格式，检查转换工具

6. 如果用户要求总结，默认提供：

- 一句话概述 - 3–8个关键点 - 仅在明确存在时添加额外部分（日期、人员、风险、数据、结论、联系人）

7. 如果用户要求提取，优先使用结构化字段而非长篇叙述。

支持的格式和策略

PDF

- 首先使用 pypdf 提取嵌入文本。
如果提取的文本过短，回退到OCR。
OCR优先使用 chisim+eng，然后是 chisim，最后是 eng。
OCR流程需要同时安装 pdftoppm 和 tesseract。
如果环境中暴露了官方一流的PDF工具，且任务价值高或多PDF处理，可优先使用该工具；否则使用本技能的脚本。

Word

- .docx：直接提取段落和表格。
.doc：依次尝试 antiword、catdoc，然后使用LibreOffice转换为 .docx。

Excel

- 提取工作表名称和每个工作表的前几行。
最适合快速了解工作簿结构和核心字段。
解释时，重点说明每个工作表代表什么、关键列、重要数值和明显异常。

PowerPoint

- 从形状中提取幻灯片文本。
提取演讲者备注（如存在）。
总结通常应按幻灯片或主题进行，而非大量原始数据转储。

工具和依赖项

清晰说明哪些是必需的，哪些是可选的。

必需运行时

- python3

必需的Python包

- pypdf — 从PDF中提取嵌入文本
python-docx — 提取 .docx
openpyxl — 提取 .xlsx
python-pptx — 提取 .pptx

可选但强烈推荐的系统工具

- poppler-utils — 提供 pdftoppm，用于PDF→图像转换（OCR前）
tesseract-ocr — OCR引擎
tesseract-ocr-chi-sim — 简体中文OCR语言包
libreoffice — 旧版 .doc、.xls、.ppt 的转换回退方案
antiword — 直接提取 .doc 的回退方案
catdoc — 额外的 .doc 提取回退方案

各工具的用途

- pypdf：首先尝试从PDF中提取文本层
pdftoppm：需要OCR时将PDF页面栅格化
tesseract：从扫描/图像PDF中恢复文本
python-docx：从 .docx 读取段落和表格
openpyxl：从 .xlsx 读取工作表和行
python-pptx：从 .pptx 读取幻灯片文本和备注
libreoffice：将旧版Office格式转换为可解析的新格式
antiword / catdoc：.doc 的轻量级提取选项

最小实用配置

如果只处理现代文档，最小实用配置为：

- python3
Python包：pypdf、python-docx、openpyxl、python-pptx

依赖检查

使用捆绑的检查器快速查看当前环境中缺少什么：

bash
python3 {skilldir}/scripts/checkdeps.py

常用命令

bash
python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.pdf --json
python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.docx --json
python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.xlsx --json
python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.pptx --json

有用的标志：

bash

限制扫描/提取的PDF页数

python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.pdf --page-limit 10 --json

探测电子表格时限制每表的行数

python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.xlsx --row-limit 30 --json

限制输出文本大小

python3 {skilldir}/scripts/extractoffice_text.py /path/to/file.pdf --max-chars 30000 --json

输出风格

默认提供简洁回答：

- 一句话总结
3–8个关键点
仅在用户要求时展开：

- 详细总结
- 逐页/逐幻灯片说明
- 字段提取
- 文档比较

失败处理

- 如果PDF文本为空，怀疑是扫描页面或缺少OCR工具。
如果中文OCR效果不佳，检查是否安装了 tesseract-ocr-chi-sim。
如果 .doc / .xls / .ppt 提取失败，检查 libreoffice、antiword 和 catdoc。
如果表格显示混乱，说明这是优先文本提取，而非完整布局重建。
如果文件加密或无法读取，直接说明并停止猜测。

参考资料

仅在需要时阅读：

- references/capabilities.md — 能力边界及各格式的优劣
references/troubleshooting.md — 依赖检查和常见故障模式

office-document-assistant办公文档助手