docling文档内容提取

Extract and parse content from web pages, PDFs, documents (docx, pptx), and images using the docling CLI with GPU acceleration. Use INSTEAD of web_fetch for extracting content from specific URLs when you need clean, structured text. Use Brave (web_search) for searching/discovering pages. Use docling when you HAVE a URL and need its content parsed.

作者: admin | 来源: ClawHub

Docling - Document & Web Content Extraction

CLI tool for parsing documents and web pages into clean, structured text. Uses GPU acceleration for OCR and ML models.

Prerequisites

- docling CLI must be installed (e.g., via pipx install docling)
For GPU support: NVIDIA GPU with CUDA drivers

When to Use

- Extract content from a URL → Use docling (not webfetch)
Search for information → Use websearch (Brave)
Parse PDFs, DOCX, PPTX → Use docling
OCR on images → Use docling

Quick Commands

Web Page → Markdown (default)

docling "<URL>" --from html --to md

Output: creates a .md file in current directory (or use --output)

Web Page → Plain Text

CODEBLOCK1

PDF with OCR

CODEBLOCK2

Key Options

Option	Values	Description
INLINECODE4	html, pdf, docx, pptx, image, md, csv, xlsx	Input format
INLINECODE5

Security Notes

⚠️ Avoid these flags unless you trust the source:

- --enable-remote-services - can send data to remote endpoints
INLINECODE11 - loads third-party code
Custom --headers with untrusted values - can redirect requests

Workflow

1. For web content extraction: Use INLINECODE13
Read the output file from the specified output directory
Clean up the output directory after reading

GPU Support

Docling supports GPU acceleration via CUDA (NVIDIA). Verify CUDA is available:
CODEBLOCK3

Full CLI Reference

See references/cli-reference.md for complete option list.

Docling - 文档与网页内容提取

用于将文档和网页解析为清晰、结构化文本的CLI工具。支持GPU加速进行OCR和机器学习模型处理。

前置条件

- 必须安装docling CLI（例如通过pipx install docling安装）
GPU支持：需配备NVIDIA GPU及CUDA驱动

使用场景

- 从URL提取内容 → 使用docling（而非webfetch）
搜索信息 → 使用websearch（Brave）
解析PDF、DOCX、PPTX → 使用docling
图像OCR → 使用docling

快速命令

网页 → Markdown（默认）

bash docling --from html --to md

输出：在当前目录生成.md文件（或使用--output指定路径）

网页 → 纯文本

bash docling --from html --to text --output /tmp/docling_out

PDF OCR处理

bash docling /path/to/file.pdf --ocr --device cuda --output /tmp/docling_out

关键选项

选项	可选值	说明
--from	html, pdf, docx, pptx, image, md, csv, xlsx	输入格式
--to

安全注意事项

⚠️ 除非信任来源，否则避免使用以下标志：

- --enable-remote-services - 可能向远程端点发送数据
--allow-external-plugins - 加载第三方代码
使用不可信值的自定义--headers - 可能重定向请求

工作流程

1. 网页内容提取：使用docling --from html --to text --output /tmp/docling_out
读取输出文件：从指定的输出目录获取结果
清理：读取完成后清理输出目录

GPU支持

Docling通过CUDA（NVIDIA）支持GPU加速。验证CUDA是否可用：
bash
python -c import torch; print(torch.cuda.is_available())

完整CLI参考

完整选项列表请参见 references/cli-reference.md

docling文档内容提取

docling

Docling - Document & Web Content Extraction

Prerequisites

When to Use