Office Documents Skill
This skill provides comprehensive tools and workflows for working with Microsoft Word (.docx) and WPS Office documents. It covers creation, editing, conversion, analysis, and troubleshooting of professional documents.
Quick Start
Basic Operations
Read document content:
CODEBLOCK0
Create new document:
CODEBLOCK1
Common Tasks
- 1. Text extraction - See TEXTEXTRACTION.md
- Format conversion - See CONVERSION.md
- Document analysis - See ANALYSIS.md
- Troubleshooting - See TROUBLESHOOTING.md
Core Tools and Libraries
Python Libraries
For .docx files:
- -
python-docx - Primary library for reading/writing .docx - INLINECODE1 - Simple text extraction
- INLINECODE2 - Advanced document composition
- INLINECODE3 - Mail merge functionality
For WPS files:
- -
pywps - WPS file manipulation (when available) - Conversion to .docx first recommended
For format conversion:
- -
pandoc - Universal document converter - INLINECODE6 - Office suite for conversion
- INLINECODE7 - Universal office converter
Command Line Tools
Document conversion:
CODEBLOCK2
Document analysis:
CODEBLOCK3
Workflows
1. Document Creation Workflow
When creating new documents:
- 1. Choose template - Start from template or create from scratch
- Add structure - Headings, paragraphs, lists
- Apply formatting - Styles, fonts, spacing
- Add elements - Tables, images, hyperlinks
- Finalize - Page setup, headers/footers, save
See CREATION.md for detailed patterns.
2. Document Editing Workflow
When modifying existing documents:
- 1. Backup original - Always create backup first
- Analyze structure - Understand document layout
- Make changes - Edit content, update formatting
- Preserve formatting - Maintain original styles
- Validate - Check for corruption, save new version
See EDITING.md for detailed patterns.
3. Conversion Workflow
When converting between formats:
- 1. Identify source format - .docx, .wps, .doc, .rtf, etc.
- Choose conversion tool - Based on format and requirements
- Convert - With appropriate options
- Verify - Check content preservation
- Clean up - Remove temporary files
See CONVERSION.md for detailed patterns.
Common Issues and Solutions
1. Corrupted Documents
Symptoms: Won't open, error messages, missing content
Solutions:
- - Try opening in different application
- Use recovery mode in Word/WPS
- Extract content with
python-docx ignoring errors - Convert to different format and back
See TROUBLESHOOTING.md for detailed recovery procedures.
2. Formatting Issues
Symptoms: Wrong fonts, broken layout, missing styles
Solutions:
- - Check style definitions
- Verify font availability
- Use template-based approach
- Simplify complex formatting
3. Compatibility Problems
Symptoms: Different appearance in Word vs WPS, missing features
Solutions:
- - Stick to common features
- Test in both applications
- Use standard formats
- Provide alternative versions
Advanced Features
Document Automation
Batch processing:
CODEBLOCK4
Template-based generation:
CODEBLOCK5
Document Analysis
Extract statistics:
CODEBLOCK6
Check formatting consistency:
CODEBLOCK7
Best Practices
1. Always Backup
CODEBLOCK8
2. Use Version Control
- - Save incremental versions
- Use descriptive filenames
- Document changes made
3. Test Thoroughly
- - Test in target application
- Verify all content preserved
- Check formatting integrity
4. Handle Errors Gracefully
CODEBLOCK9
Reference Files
For detailed information on specific topics, consult these reference files:
Scripts
Available scripts in the scripts/ directory:
- -
extract_text.py - Extract text from .docx files - INLINECODE11 - Convert between document formats
- INLINECODE12 - Process multiple documents
- INLINECODE13 - Generate document statistics
- INLINECODE14 - Attempt to repair corrupted documents
Run scripts with appropriate parameters:
CODEBLOCK10
Getting Help
If you encounter issues not covered in this skill:
- 1. Check the relevant reference file
- Search for specific error messages
- Try alternative approaches
- Consider converting to simpler format
Remember: When in doubt, create a backup and work on a copy.
Office 文档技能
本技能提供了一套全面的工具和工作流程,用于处理 Microsoft Word (.docx) 和 WPS Office 文档。涵盖专业文档的创建、编辑、转换、分析和故障排除。
快速入门
基本操作
读取文档内容:
python
使用 python-docx 处理 .docx 文件
from docx import Document
doc = Document(document.docx)
text = \n.join([paragraph.text for paragraph in doc.paragraphs])
创建新文档:
python
from docx import Document
from docx.shared import Inches
doc = Document()
doc.add_heading(文档标题, 0)
doc.add_paragraph(这是一个新段落。)
doc.save(new_document.docx)
常见任务
- 1. 文本提取 - 参见 TEXTEXTRACTION.md
- 格式转换 - 参见 CONVERSION.md
- 文档分析 - 参见 ANALYSIS.md
- 故障排除 - 参见 TROUBLESHOOTING.md
核心工具和库
Python 库
用于 .docx 文件:
- - python-docx - 读写 .docx 文件的主要库
- docx2txt - 简单的文本提取
- docxcompose - 高级文档组合
- docx-mailmerge - 邮件合并功能
用于 WPS 文件:
- - pywps - WPS 文件操作(可用时)
- 建议先转换为 .docx 格式
用于格式转换:
- - pandoc - 通用文档转换器
- libreoffice - 用于转换的办公套件
- unoconv - 通用办公文档转换器
命令行工具
文档转换:
bash
将 .docx 转换为 PDF
libreoffice --headless --convert-to pdf document.docx
将 .docx 转换为文本
pandoc document.docx -o document.txt
批量将 WPS 转换为 .docx
for file in *.wps; do libreoffice --headless --convert-to docx $file; done
文档分析:
bash
提取元数据
exiftool document.docx
检查文件完整性
file document.docx
工作流程
1. 文档创建工作流程
创建新文档时:
- 1. 选择模板 - 从模板开始或从头创建
- 添加结构 - 标题、段落、列表
- 应用格式 - 样式、字体、间距
- 添加元素 - 表格、图片、超链接
- 最终确定 - 页面设置、页眉/页脚、保存
详细模式参见 CREATION.md。
2. 文档编辑工作流程
修改现有文档时:
- 1. 备份原文件 - 始终先创建备份
- 分析结构 - 了解文档布局
- 进行更改 - 编辑内容、更新格式
- 保留格式 - 保持原始样式
- 验证 - 检查是否损坏,保存新版本
详细模式参见 EDITING.md。
3. 转换工作流程
在格式之间转换时:
- 1. 识别源格式 - .docx、.wps、.doc、.rtf 等
- 选择转换工具 - 根据格式和要求选择
- 转换 - 使用适当的选项
- 验证 - 检查内容是否保留
- 清理 - 删除临时文件
详细模式参见 CONVERSION.md。
常见问题及解决方案
1. 文档损坏
症状: 无法打开、错误消息、内容丢失
解决方案:
- - 尝试在不同应用程序中打开
- 使用 Word/WPS 的恢复模式
- 使用 python-docx 忽略错误提取内容
- 转换为其他格式再转回
详细恢复步骤参见 TROUBLESHOOTING.md。
2. 格式问题
症状: 字体错误、布局损坏、样式丢失
解决方案:
- - 检查样式定义
- 验证字体可用性
- 使用基于模板的方法
- 简化复杂格式
3. 兼容性问题
症状: Word 和 WPS 中显示不同、功能缺失
解决方案:
- - 使用通用功能
- 在两个应用程序中测试
- 使用标准格式
- 提供替代版本
高级功能
文档自动化
批量处理:
python
import os
from docx import Document
def processdocuments(folderpath):
for filename in os.listdir(folder_path):
if filename.endswith(.docx):
docpath = os.path.join(folderpath, filename)
processsingledocument(doc_path)
基于模板的生成:
python
from docx import Document
def generatefromtemplate(template_path, data):
doc = Document(template_path)
# 用数据替换占位符
for paragraph in doc.paragraphs:
for key, value in data.items():
if f{{{{ {key} }}}} in paragraph.text:
paragraph.text = paragraph.text.replace(f{{{{ {key} }}}}, value)
return doc
文档分析
提取统计信息:
python
def analyzedocument(docpath):
doc = Document(doc_path)
stats = {
paragraphs: len(doc.paragraphs),
tables: len(doc.tables),
images: len(doc.inline_shapes),
sections: len(doc.sections),
styles: len(doc.styles)
}
return stats
检查格式一致性:
python
def check_formatting(doc):
issues = []
for i, para in enumerate(doc.paragraphs):
if para.style.name == Normal and para.text.strip():
# 检查不一致的格式
if len(para.runs) > 1:
issues.append(f段落 {i}: Normal 样式中存在多个运行)
return issues
最佳实践
1. 始终备份
python
import shutil
import os
def backup_document(filepath):
backup_path = filepath + .backup
shutil.copy2(filepath, backup_path)
return backup_path
2. 使用版本控制
3. 全面测试
- - 在目标应用程序中测试
- 验证所有内容已保留
- 检查格式完整性
4. 优雅处理错误
python
try:
doc = Document(filepath)
except Exception as e:
print(f打开 {filepath} 时出错: {e})
# 尝试替代方法
return extract
textfallback(filepath)
参考文件
有关特定主题的详细信息,请查阅以下参考文件:
脚本
scripts/ 目录中可用的脚本:
- - extracttext.py - 从 .docx 文件中提取文本
- convertformat.py - 在文档格式之间转换
- batchprocess.py - 批量处理多个文档
- documentstats.py - 生成文档统计信息
- repair_document.py - 尝试修复损坏的文档
使用适当的参数运行脚本:
bash
python scripts/extract_text.py input.docx output.txt
获取帮助
如果遇到本技能未涵盖的问题:
- 1. 检查相关参考文件
- 搜索特定错误消息
- 尝试替代方法
- 考虑转换为更简单的格式
请记住:如有疑问,请创建备份并在副本上操作。