DOCX creation, editing, and analysis
Overview
A .docx file is a ZIP archive containing XML files.
Quick Reference
| Task | Approach |
|---|
| Read/analyze content | INLINECODE0 or unpack for raw XML |
| Create new document |
Use
docx-js - see Creating New Documents below |
| Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below |
Converting .doc to .docx
Legacy .doc files must be converted before editing:
CODEBLOCK0
Reading Content
CODEBLOCK1
Converting to Images
CODEBLOCK2
Accepting Tracked Changes
To produce a clean document with all tracked changes accepted (requires LibreOffice):
CODEBLOCK3
Creating New Documents
Generate .docx files with JavaScript, then validate. Install: INLINECODE3
Setup
CODEBLOCK4
Validation
After creating the file, validate it. If validation fails, unpack, fix the XML, and repack.
CODEBLOCK5
Page Size
CODEBLOCK6
Common page sizes (DXA units, 1440 DXA = 1 inch):
| Paper | Width | Height | Content Width (1" margins) |
|---|
| US Letter | 12,240 | 15,840 | 9,360 |
| A4 (default) |
11,906 | 16,838 | 9,026 |
Landscape orientation: docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap:
CODEBLOCK7
Styles (Override Built-in Headings)
Use Arial as the default font (universally supported). Keep titles black for readability.
CODEBLOCK8
Lists (NEVER use unicode bullets)
CODEBLOCK9
Tables
CRITICAL: Tables need dual widths - set both columnWidths on the table AND width on each cell. Without both, tables render incorrectly on some platforms.
CODEBLOCK10
Table width calculation:
Always use WidthType.DXA — WidthType.PERCENTAGE breaks in Google Docs.
CODEBLOCK11
Width rules:
- - Always use
WidthType.DXA — never WidthType.PERCENTAGE (incompatible with Google Docs) - Table width must equal the sum of INLINECODE10
- Cell
width must match corresponding INLINECODE12 - Cell
margins are internal padding - they reduce content area, not add to cell width - For full-width tables: use content width (page width minus left and right margins)
Images
CODEBLOCK12
Page Breaks
CODEBLOCK13
Table of Contents
CODEBLOCK14
Headers/Footers
CODEBLOCK15
Critical Rules for docx-js
- - Set page size explicitly - docx-js defaults to A4; use US Letter (12240 x 15840 DXA) for US documents
- Landscape: pass portrait dimensions - docx-js swaps width/height internally; pass short edge as
width, long edge as height, and set INLINECODE16 - Never use
\n - use separate Paragraph elements - Never use unicode bullets - use
LevelFormat.BULLET with numbering config - PageBreak must be in Paragraph - standalone creates invalid XML
- ImageRun requires
type - always specify png/jpg/etc - Always set table
width with DXA - never use WidthType.PERCENTAGE (breaks in Google Docs) - Tables need dual widths -
columnWidths array AND cell width, both must match - Table width = sum of columnWidths - for DXA, ensure they add up exactly
- Always add cell margins - use
margins: { top: 80, bottom: 80, left: 120, right: 120 } for readable padding - Use
ShadingType.CLEAR - never SOLID for table shading - TOC requires HeadingLevel only - no custom styles on heading paragraphs
- Override built-in styles - use exact IDs: "Heading1", "Heading2", etc.
- Include
outlineLevel - required for TOC (0 for H1, 1 for H2, etc.)
Editing Existing Documents
Follow all 3 steps in order.
Step 1: Unpack
python scripts/office/unpack.py document.docx unpacked/
Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities (
“ etc.) so they survive editing. Use
--merge-runs false to skip run merging.
Step 2: Edit XML
Edit files in unpacked/word/. See XML Reference below for patterns.
Use "Claude" as the author for tracked changes and comments, unless the user explicitly requests use of a different name.
Use the Edit tool directly for string replacement. Do not write Python scripts. Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced.
CRITICAL: Use smart quotes for new content. When adding text with apostrophes or quotes, use XML entities to produce smart quotes:
<!-- Use these entities for professional typography -->
<w:t>Here’s a quote: “Hello”</w:t>
| Entity | Character |
|---|
| INLINECODE30 | ‘ (left single) |
| INLINECODE31 |
’ (right single / apostrophe) |
|
“ | “ (left double) |
|
” | ” (right double) |
Adding comments: Use comment.py to handle boilerplate across multiple XML files (text must be pre-escaped XML):
python scripts/comment.py unpacked/ 0 "Comment text with & and ’"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # reply to comment 0
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author" # custom author name
Then add markers to document.xml (see Comments in XML Reference).
Step 3: Pack
python scripts/office/pack.py unpacked/ output.docx --original document.docx
Validates with auto-repair, condenses XML, and creates DOCX. Use
--validate false to skip.
Auto-repair will fix:
- -
durableId >= 0x7FFFFFFF (regenerates valid ID) - Missing
xml:space="preserve" on <w:t> with whitespace
Auto-repair won't fix:
- - Malformed XML, invalid element nesting, missing relationships, schema violations
Common Pitfalls
- - Replace entire
<w:r> elements: When adding tracked changes, replace the whole <w:r>...</w:r> block with <w:del>...<w:ins>... as siblings. Don't inject tracked change tags inside a run. - Preserve
<w:rPr> formatting: Copy the original run's <w:rPr> block into your tracked change runs to maintain bold, font size, etc.
XML Reference
Schema Compliance
- - Element order in
<w:pPr>: <w:pStyle>, <w:numPr>, <w:spacing>, <w:ind>, <w:jc>, <w:rPr> last - Whitespace: Add
xml:space="preserve" to <w:t> with leading/trailing spaces - RSIDs: Must be 8-digit hex (e.g.,
00AB1234)
Tracked Changes
Insertion:
CODEBLOCK20
Deletion:
CODEBLOCK21
Inside <w:del>: Use <w:delText> instead of <w:t>, and <w:delInstrText> instead of <w:instrText>.
Minimal edits - only mark what changes:
CODEBLOCK22
Deleting entire paragraphs/list items - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add <w:del/> inside <w:pPr><w:rPr>:
<w:p>
<w:pPr>
<w:numPr>...</w:numPr> <!-- list numbering if present -->
<w:rPr>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
</w:rPr>
</w:pPr>
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
</w:del>
</w:p>
Without the
<w:del/> in
<w:pPr><w:rPr>, accepting changes leaves an empty paragraph/list item.
Rejecting another author's insertion - nest deletion inside their insertion:
CODEBLOCK24
Restoring another author's deletion - add insertion after (don't modify their deletion):
CODEBLOCK25
Comments
After running comment.py (see Step 2), add markers to document.xml. For replies, use --parent flag and nest markers inside the parent's.
CRITICAL: <w:commentRangeStart> and <w:commentRangeEnd> are siblings of <w:r>, never inside <w:r>.
CODEBLOCK26
Images
- 1. Add image file to INLINECODE69
- Add relationship to
word/_rels/document.xml.rels:
<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>
- 3. Add content type to
[Content_Types].xml:
<Default Extension="png" ContentType="image/png"/>
- 4. Reference in document.xml:
<w:drawing>
<wp:inline>
<wp:extent cx="914400" cy="914400"/> <!-- EMUs: 914400 = 1 inch -->
<a:graphic>
<a:graphicData uri=".../picture">
<pic:pic>
<pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
Dependencies
- - pandoc: Text extraction
- docx:
npm install -g docx (new documents) - LibreOffice: PDF conversion (auto-configured for sandboxed environments via
scripts/office/soffice.py) - Poppler:
pdftoppm for images
DOCX 创建、编辑与分析
概述
.docx 文件是一个包含 XML 文件的 ZIP 压缩包。
快速参考
| 任务 | 方法 |
|---|
| 读取/分析内容 | pandoc 或解包获取原始 XML |
| 创建新文档 |
使用 docx-js - 参见下方创建新文档 |
| 编辑现有文档 | 解包 → 编辑 XML → 重新打包 - 参见下方编辑现有文档 |
将 .doc 转换为 .docx
旧版 .doc 文件在编辑前必须转换:
bash
python scripts/office/soffice.py --headless --convert-to docx document.doc
读取内容
bash
带修订标记的文本提取
pandoc --track-changes=all document.docx -o output.md
原始 XML 访问
python scripts/office/unpack.py document.docx unpacked/
转换为图片
bash
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page
接受修订标记
要生成接受所有修订标记的干净文档(需要 LibreOffice):
bash
python scripts/accept_changes.py input.docx output.docx
创建新文档
使用 JavaScript 生成 .docx 文件,然后进行验证。安装:npm install -g docx
设置
javascript
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
VerticalAlign, PageNumber, PageBreak } = require(docx);
const doc = new Document({ sections: [{ children: [/ 内容 /] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync(doc.docx, buffer));
验证
创建文件后,进行验证。如果验证失败,解包,修复 XML,然后重新打包。
bash
python scripts/office/validate.py doc.docx
页面尺寸
javascript
// 关键:docx-js 默认使用 A4,而非 US Letter
// 始终显式设置页面尺寸以确保一致的结果
sections: [{
properties: {
page: {
size: {
width: 12240, // 8.5 英寸(DXA 单位)
height: 15840 // 11 英寸(DXA 单位)
},
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 英寸页边距
}
},
children: [/ 内容 /]
}]
常见页面尺寸(DXA 单位,1440 DXA = 1 英寸):
| 纸张 | 宽度 | 高度 | 内容宽度(1 英寸页边距) |
|---|
| US Letter | 12,240 | 15,840 | 9,360 |
| A4(默认) |
11,906 | 16,838 | 9,026 |
横向方向: docx-js 在内部交换宽度/高度,因此传入纵向尺寸,让其处理交换:
javascript
size: {
width: 12240, // 传入短边作为宽度
height: 15840, // 传入长边作为高度
orientation: PageOrientation.LANDSCAPE // docx-js 在 XML 中交换它们
},
// 内容宽度 = 15840 - 左边距 - 右边距(使用长边)
样式(覆盖内置标题)
使用 Arial 作为默认字体(通用支持)。保持标题为黑色以确保可读性。
javascript
const doc = new Document({
styles: {
default: { document: { run: { font: Arial, size: 24 } } }, // 12pt 默认
paragraphStyles: [
// 重要:使用精确 ID 覆盖内置样式
{ id: Heading1, name: Heading 1, basedOn: Normal, next: Normal, quickFormat: true,
run: { size: 32, bold: true, font: Arial },
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // TOC 需要 outlineLevel
{ id: Heading2, name: Heading 2, basedOn: Normal, next: Normal, quickFormat: true,
run: { size: 28, bold: true, font: Arial },
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
]
},
sections: [{
children: [
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun(标题)] }),
]
}]
});
列表(切勿使用 Unicode 项目符号)
javascript
// ❌ 错误 - 切勿手动插入项目符号字符
new Paragraph({ children: [new TextRun(• 项目)] }) // 错误
new Paragraph({ children: [new TextRun(\u2022 项目)] }) // 错误
// ✅ 正确 - 使用带 LevelFormat.BULLET 的编号配置
const doc = new Document({
numbering: {
config: [
{ reference: bullets,
levels: [{ level: 0, format: LevelFormat.BULLET, text: •, alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: numbers,
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: %1., alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
]
},
sections: [{
children: [
new Paragraph({ numbering: { reference: bullets, level: 0 },
children: [new TextRun(项目符号项)] }),
new Paragraph({ numbering: { reference: numbers, level: 0 },
children: [new TextRun(编号项)] }),
]
}]
});
// ⚠️ 每个 reference 创建独立的编号
// 相同 reference = 继续(1,2,3 然后 4,5,6)
// 不同 reference = 重新开始(1,2,3 然后 1,2,3)
表格
关键:表格需要双重宽度 - 在表格上设置 columnWidths,同时在每个单元格上设置 width。缺少两者之一,表格在某些平台上会渲染不正确。
javascript
// 关键:始终设置表格宽度以确保一致渲染
// 关键:使用 ShadingType.CLEAR(而非 SOLID)以防止黑色背景
const border = { style: BorderStyle.SINGLE, size: 1, color: CCCCCC };
const borders = { top: border, bottom: border, left: border, right: border };
new Table({
width: { size: 9360, type: WidthType.DXA }, // 始终使用 DXA(百分比在 Google Docs 中会出错)
columnWidths: [4680, 4680], // 必须总和等于表格宽度(DXA:1440 = 1 英寸)
rows: [
new TableRow({
children: [
new TableCell({
borders,
width: { size: 4680, type: WidthType.DXA }, // 也在每个单元格上设置
shading: { fill: D5E8F0, type: ShadingType.CLEAR }, // CLEAR 而非 SOLID
margins: { top: 80, bottom: 80, left: 120, right: 120 }, // 单元格内边距(内部,不添加到宽度)
children: [new Paragraph({ children: [new TextRun(单元格)] })]
})
]
})
]
})
表格宽度计算:
始终使用 WidthType.DXA — WidthType.PERCENTAGE 在 Google Docs 中会出错。
javascript
// 表格宽度 = columnWidths 之和 = 内容宽度
// US Letter 带 1 英寸页边距:12240 - 2880 = 9360 DXA
width: { size: 9360, type: WidthType.DXA },
columnWidths: [7000, 2360] // 必须总和等于表格宽度
宽度规则:
- - 始终使用 WidthType.DXA — 切勿使用 WidthType.PERCENTAGE(与 Google Docs 不兼容)
- 表格宽度必须等于 columnWidths 之和