Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion
Scope: Taiwan civil court judgments only (民事判決). This skill ingests Taiwan civil cases (HTML or PDF files) into Qdrant. All parsing, chunking, and embedding logic lives in scripts/ingest.py — your job is to run the script, not to reimplement the pipeline.
Quick Start (follow these steps in order)
Step 1 — Activate venv
CODEBLOCK0
Step 2 — Identify the run folder
The user will provide an absolute path to a run folder.
Example: INLINECODE1
Verify it exists and has HTML or PDF files:
CODEBLOCK1
If no archive/*.html or archive/*.pdf files → stop and tell the user the folder has no ingestible data.
Step 3 — Run ingestion
Use absolute paths throughout — no cd needed:
CODEBLOCK2
The script handles everything: pre-flight checks, collection auto-creation (creates civil_case_doc / civil_case_chunk if they don't exist), canonicalization, chunking, embedding, Qdrant upsert, manifest + report writing.
Re-running the same command on the same folder is always safe — deterministic IDs mean upsert = overwrite. No special --resume flag needed; just run the same command again.
Step 4 — Check the result
Successful output looks like:
CODEBLOCK3
Read the report (human-readable stats summary):
CODEBLOCK4
If there are errors, check the manifest (machine-readable, one JSON line per file) for per-file diagnosis:
CODEBLOCK5
Step 5 — Report to user
Tell the user:
- - How many docs were ingested (
doc_points) - How many chunks were created (
chunk_points) - Whether any were skipped or errored
- Where the report file is
Done. Do not proceed to additional steps unless the user asks.
DO NOT rules (critical)
- - DO NOT write your own HTML parsing, chunking, or embedding code.
ingest.py handles all of this. - DO NOT modify parsing/chunking logic casually. Only change heading detection or chunk fallback when the user explicitly asks to improve PDF/OCR robustness, and validate on a small sample before re-running a large batch.
- DO NOT call Qdrant or Ollama APIs directly. The script does this.
- DO NOT use
verify=False or skip SSL verification for any HTTP request. - DO NOT modify or delete files under
archive/. Raw HTML is immutable source of truth. - DO NOT change chunking defaults (
--max-chars, --overlap-chars) unless the user explicitly asks.
Hard constraints
- - Raw HTML/PDF is source of truth; never overwrite it.
- Deterministic: same input → same canonical text → same SHA-256 → same Qdrant point IDs. Safe to re-run.
- Traceability: every Qdrant point carries
doc_url + local_path. - Batched upserts (≤ 64 points/batch) to avoid Qdrant 32MB payload limit.
parser_version in every point's metadata. Current: v3.5-sentence-boundary.
Troubleshooting
PREFLIGHT_FAILED: Qdrant not reachable
Qdrant is down or unreachable at the default/configured URL.
CODEBLOCK6
PREFLIGHT_FAILED: Ollama not reachable
CODEBLOCK7
PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest
CODEBLOCK8
Then re-run Step 3.
PREFLIGHT_FAILED: No archive/*.html or archive/*.pdf found
The run folder exists but has no archived detail pages. Check:
- - Is this the correct run folder?
Output shows skipped > 0 or errored > 0
Check ingest_manifest.jsonl for per-file details:
grep -E '"status":"(skipped|error|partial)"' "<RUN_FOLDER>/ingest_manifest.jsonl"
| Manifest status | Meaning | Action |
|---|
| INLINECODE26 | Doc + all chunks ingested | None |
| INLINECODE27 |
Doc upserted, but some section chunks failed embedding | Check Ollama stability; can re-run safely |
|
skipped | Doc-level embedding failed — nothing upserted for this doc | Check Ollama; re-run safely |
|
error | HTML read/parse failed | Check if the HTML file is corrupted |
Re-running is always safe — use the exact same command. No special flags needed; deterministic IDs → upsert/overwrite.
Override service endpoints
CODEBLOCK10
Default endpoints:
| Service | Default | Env override |
|---|
| Ollama | INLINECODE30 | INLINECODE31 |
| Qdrant |
http://localhost:6333 |
$QDRANT_URL |
Test with a small batch first
CODEBLOCK11
Input folder structure (expected)
CODEBLOCK12
The script discovers all archive/*.html and archive/*.pdf files automatically (sorted by filename). HTML and PDF files can coexist in the same run folder.
v1 limitation: The system metadata field is currently hardcoded to FJUD. If a run folder contains both FJUD and FINT files, FINT files will be ingested but mislabeled as FJUD. This does not affect chunking or embeddings — only the system metadata field on the resulting Qdrant points.
CLI reference
CODEBLOCK13
| Flag | Default | Description |
|---|
| INLINECODE40 | (required) | Path to an input folder |
| INLINECODE41 |
$OLLAMA_URL or
http://localhost:11434 | Ollama endpoint |
|
--qdrant |
$QDRANT_URL or
http://localhost:6333 | Qdrant endpoint |
|
--embed-model |
bge-m3:latest | Ollama embedding model |
|
--vector-size |
1024 | Vector dimension |
|
--max-chars |
900 | Max chars per chunk (500–1000) |
|
--overlap-chars |
150 | Overlap between chunks (10–20% of max-chars) |
|
--limit |
0 (no limit) | Process only first N files sorted by filename (lexicographic order); for testing |
Outputs
- - Qdrant collections:
civil_case_doc (1 point/doc), civil_case_chunk (many points/doc). Auto-created if they don't exist. ingest_report.md: human-readable summary (doc/chunk counts, error counts). Read this first after ingestion.ingest_manifest.jsonl: machine-readable, one JSON line per doc with status (ok / partial / skipped / error). Read this to diagnose specific file failures (grep for non-ok statuses). Both files overlap on aggregate counts; the manifest adds per-file detail.
Roadmap
- - v1 (current): doc + section-aware chunks
- v2: candidate issue extraction (爭點抽取)
- v3: issue-level index (
civil_case_issue collection)
Internal details
For metadata schema, canonicalization rules, section-splitting patterns, and chunking implementation, see references/internals.md.
Lessons learned / operational gotchas
- - Qdrant rejects non-UUID/non-integer point IDs (
400 Bad Request). The script uses deterministic UUIDs — do not change the ID generation logic. - Qdrant rejects payloads > 32MB. The script batches at 64 points — do not increase batch size.
- Re-running on the same folder is safe: deterministic IDs mean upsert = overwrite.
- 台灣判決書 section headings 格式不統一(e.g.「理 由」with fullwidth space、兼容字如「⽂」)。目前 parser 已先做 heading normalization;若仍切不出 section,會 fallback 對
full 做 chunking,避免只留下 doc-level points。
台灣民事判決 → 向量資料庫 (Qdrant) 匯入
範圍:僅限台灣民事法院判決(民事判決)。此技能將台灣民事案件(HTML 或 PDF 檔案)匯入 Qdrant。所有解析、分塊和嵌入邏輯都位於 scripts/ingest.py 中 — 您的工作是執行該腳本,而非重新實作整個流程。
快速開始(請依序執行以下步驟)
步驟 1 — 啟動虛擬環境
bash
source {baseDir}/.venv/bin/activate
步驟 2 — 確認執行資料夾
使用者將提供執行資料夾的絕對路徑。
範例:/path/to/output/judicialyuan/20260305_142030
確認該資料夾存在且包含 HTML 或 PDF 檔案:
bash
ls /archive/ | grep -E \.(html|pdf)$ | head -5
如果沒有 archive/.html 或 archive/.pdf 檔案 → 停止並告知使用者該資料夾沒有可匯入的資料。
步驟 3 — 執行匯入
全程使用絕對路徑 — 無需 cd:
bash
python3 {baseDir}/scripts/ingest.py \
--run-folder
該腳本會處理所有事項:前置檢查、自動建立集合(若 civilcasedoc / civilcasechunk 不存在則建立)、正規化、分塊、嵌入、Qdrant 更新、寫入清單與報告。
對同一資料夾重複執行相同指令始終安全 — 確定性 ID 意味著更新即覆寫。無需特殊的 --resume 標誌;只需再次執行相同指令即可。
步驟 4 — 檢查結果
成功輸出如下:
OK files=42 processed=42 skipped=0 errored=0 docpoints=42 chunkpoints=187
manifest=FOLDER>/ingestmanifest.jsonl
report=FOLDER>/ingestreport.md
閱讀報告(人類可讀的統計摘要):
bash
cat FOLDER>/ingestreport.md
如果有錯誤,請檢查清單(機器可讀,每檔案一行 JSON)以取得各檔案診斷資訊:
bash
grep -E status:(skipped|error|partial) FOLDER>/ingestmanifest.jsonl
步驟 5 — 向使用者報告
告知使用者:
- - 已匯入多少文件(docpoints)
- 已建立多少區塊(chunkpoints)
- 是否有任何檔案被跳過或發生錯誤
- 報告檔案的位置
完成。 除非使用者要求,否則不要繼續執行其他步驟。
禁止規則(重要)
- - 禁止自行編寫 HTML 解析、分塊或嵌入程式碼。ingest.py 會處理所有這些。
- 禁止隨意修改解析/分塊邏輯。僅在使用者明確要求改善 PDF/OCR 穩健性時,才更改標題偵測或區塊備援方案,並在重新執行大批次前先以小樣本驗證。
- 禁止直接呼叫 Qdrant 或 Ollama API。腳本會處理這些。
- 禁止對任何 HTTP 請求使用 verify=False 或跳過 SSL 驗證。
- 禁止修改或刪除 archive/ 下的檔案。原始 HTML 是不可變的事實來源。
- 禁止更改分塊預設值(--max-chars、--overlap-chars),除非使用者明確要求。
硬性限制
- - 原始 HTML/PDF 是事實來源;絕不覆寫它。
- 確定性:相同輸入 → 相同正規化文字 → 相同 SHA-256 → 相同 Qdrant 點 ID。可安全重新執行。
- 可追溯性:每個 Qdrant 點都帶有 docurl + localpath。
- 批次更新(每批次 ≤ 64 點)以避免 Qdrant 32MB 負載限制。
- parser_version 存在於每個點的中繼資料中。目前:v3.5-sentence-boundary。
疑難排解
PREFLIGHT_FAILED: Qdrant not reachable
Qdrant 已關閉或在預設/設定的 URL 上無法連線。
bash
檢查 Qdrant 是否正在執行
curl -s http://localhost:6333/collections | head -1
如果未執行,請啟動它(或詢問使用者)
PREFLIGHT_FAILED: Ollama not reachable
bash
檢查 Ollama
curl -s http://localhost:11434/api/tags | head -5
PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest
bash
ollama pull bge-m3:latest
然後重新執行步驟 3。
PREFLIGHT_FAILED: No archive/.html or archive/.pdf found
執行資料夾存在但沒有存檔的詳細頁面。檢查:
輸出顯示 skipped > 0 或 errored > 0
檢查 ingest_manifest.jsonl 以取得各檔案詳細資訊:
bash
grep -E status:(skipped|error|partial) FOLDER>/ingestmanifest.jsonl
| 清單狀態 | 含義 | 操作 |
|---|
| ok | 文件 + 所有區塊已匯入 | 無 |
| partial |
文件已更新,但某些章節區塊嵌入失敗 | 檢查 Ollama 穩定性;可安全重新執行 |
| skipped | 文件層級嵌入失敗 — 此文件無任何內容更新 | 檢查 Ollama;可安全重新執行 |
| error | HTML 讀取/解析失敗 | 檢查 HTML 檔案是否損毀 |
重新執行始終安全 — 使用完全相同的指令。無需特殊標誌;確定性 ID → 更新/覆寫。
覆寫服務端點
bash
透過環境變數
OLLAMA
URL=http://localhost:11434 QDRANTURL=http://localhost:6333 \
python3 scripts/ingest.py --run-folder ...
透過 CLI 標誌(優先於環境變數)
python3 scripts/ingest.py --run-folder ... \
--ollama http://localhost:11434 --qdrant http://localhost:6333
預設端點:
| 服務 | 預設值 | 環境變數覆寫 |
|---|
| Ollama | http://localhost:11434 | $OLLAMAURL |
| Qdrant |
http://localhost:6333 | $QDRANTURL |
先以小批次測試
bash
python3 scripts/ingest.py --run-folder ... --limit 5
輸入資料夾結構(預期)
/
archive/
fjuddetail001.html ← HTML 輸入
fjuddetail002.html
fjuddetail003.pdf ← PDF 輸入(也支援)
fintdetail001.html (如果 system=both)
results_fjud.jsonl (可選)
results_fint.jsonl (可選)
腳本會自動發現所有 archive/.html 和 archive/.pdf 檔案(按檔名排序)。HTML 和 PDF 檔案可以在同一執行資料夾中共存。
v1 限制:system 中繼資料欄位目前硬編碼為 FJUD。如果執行資料夾同時包含 FJUD 和 FINT 檔案,FINT 檔案會被匯入但錯誤標記為 FJUD。這不影響分塊或嵌入 — 僅影響結果 Qdrant 點上的 system 中繼資料欄位。
CLI 參考
python3 scripts/ingest.py --run-folder [options]
| 標誌 | 預設值 | 說明 |
|---|
| --run-folder | (必要) | 輸入資料夾的路徑 |
| --ollama |
$OLLAMA_URL 或 http://localhost:11434 | Ollama 端點 |
| --qdrant | $QDRANT_URL 或 http://localhost:6333 | Qdrant 端點 |
| --embed-model | bge-m3:latest | Ollama 嵌入模型 |
| --vector-size | 1024 | 向量