Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion

Scope: Taiwan civil court judgments only (民事判決). This skill ingests Taiwan civil cases (HTML or PDF files) into Qdrant. All parsing, chunking, and embedding logic lives in scripts/ingest.py — your job is to run the script, not to reimplement the pipeline.

Quick Start (follow these steps in order)

Step 1 — Activate venv

CODEBLOCK0

Step 2 — Identify the run folder

The user will provide an absolute path to a run folder.

Example: INLINECODE1

Verify it exists and has HTML or PDF files:
CODEBLOCK1

If no archive/*.html or archive/*.pdf files → stop and tell the user the folder has no ingestible data.

Step 3 — Run ingestion

Use absolute paths throughout — no cd needed:

CODEBLOCK2

The script handles everything: pre-flight checks, collection auto-creation (creates civil_case_doc / civil_case_chunk if they don't exist), canonicalization, chunking, embedding, Qdrant upsert, manifest + report writing.

Re-running the same command on the same folder is always safe — deterministic IDs mean upsert = overwrite. No special --resume flag needed; just run the same command again.

Step 4 — Check the result

Successful output looks like:
CODEBLOCK3

Read the report (human-readable stats summary):
CODEBLOCK4

If there are errors, check the manifest (machine-readable, one JSON line per file) for per-file diagnosis:
CODEBLOCK5

Step 5 — Report to user

Tell the user:

- How many docs were ingested (doc_points)
How many chunks were created (chunk_points)
Whether any were skipped or errored
Where the report file is

Done. Do not proceed to additional steps unless the user asks.

DO NOT rules (critical)

- DO NOT write your own HTML parsing, chunking, or embedding code. ingest.py handles all of this.
DO NOT modify parsing/chunking logic casually. Only change heading detection or chunk fallback when the user explicitly asks to improve PDF/OCR robustness, and validate on a small sample before re-running a large batch.
DO NOT call Qdrant or Ollama APIs directly. The script does this.
DO NOT use verify=False or skip SSL verification for any HTTP request.
DO NOT modify or delete files under archive/. Raw HTML is immutable source of truth.
DO NOT change chunking defaults (--max-chars, --overlap-chars) unless the user explicitly asks.

Hard constraints

- Raw HTML/PDF is source of truth; never overwrite it.
Deterministic: same input → same canonical text → same SHA-256 → same Qdrant point IDs. Safe to re-run.
Traceability: every Qdrant point carries doc_url + local_path.
Batched upserts (≤ 64 points/batch) to avoid Qdrant 32MB payload limit.
parser_version in every point's metadata. Current: v3.5-sentence-boundary.

Troubleshooting

`PREFLIGHT_FAILED: Qdrant not reachable`

Qdrant is down or unreachable at the default/configured URL.

CODEBLOCK6

`PREFLIGHT_FAILED: Ollama not reachable`

CODEBLOCK7

`PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest`

CODEBLOCK8

Then re-run Step 3.

`PREFLIGHT_FAILED: No archive/.html or archive/.pdf found`

The run folder exists but has no archived detail pages. Check:

- Is this the correct run folder?

Output shows `skipped > 0` or `errored > 0`

Check ingest_manifest.jsonl for per-file details:

grep -E '"status":"(skipped|error|partial)"' "<RUN_FOLDER>/ingest_manifest.jsonl"

Manifest status	Meaning	Action
INLINECODE26	Doc + all chunks ingested	None
INLINECODE27

Re-running is always safe — use the exact same command. No special flags needed; deterministic IDs → upsert/overwrite.

Override service endpoints

CODEBLOCK10

Default endpoints:

Service	Default	Env override
Ollama	INLINECODE30	INLINECODE31
Qdrant

http://localhost:6333 | $QDRANT_URL |

Test with a small batch first

CODEBLOCK11

Input folder structure (expected)

CODEBLOCK12

The script discovers all archive/*.html and archive/*.pdf files automatically (sorted by filename). HTML and PDF files can coexist in the same run folder.

v1 limitation: The system metadata field is currently hardcoded to FJUD. If a run folder contains both FJUD and FINT files, FINT files will be ingested but mislabeled as FJUD. This does not affect chunking or embeddings — only the system metadata field on the resulting Qdrant points.

CLI reference

CODEBLOCK13

Flag	Default	Description
INLINECODE40	(required)	Path to an input folder
INLINECODE41

Outputs

- Qdrant collections: civil_case_doc (1 point/doc), civil_case_chunk (many points/doc). Auto-created if they don't exist.
ingest_report.md: human-readable summary (doc/chunk counts, error counts). Read this first after ingestion.
ingest_manifest.jsonl: machine-readable, one JSON line per doc with status (ok / partial / skipped / error). Read this to diagnose specific file failures (grep for non-ok statuses). Both files overlap on aggregate counts; the manifest adds per-file detail.

Roadmap

- v1 (current): doc + section-aware chunks
v2: candidate issue extraction (爭點抽取)
v3: issue-level index (civil_case_issue collection)

Internal details

For metadata schema, canonicalization rules, section-splitting patterns, and chunking implementation, see references/internals.md.

Lessons learned / operational gotchas

- Qdrant rejects non-UUID/non-integer point IDs (400 Bad Request). The script uses deterministic UUIDs — do not change the ID generation logic.
Qdrant rejects payloads > 32MB. The script batches at 64 points — do not increase batch size.
Re-running on the same folder is safe: deterministic IDs mean upsert = overwrite.
台灣判決書 section headings 格式不統一（e.g.「理　由」with fullwidth space、兼容字如「⽂」）。目前 parser 已先做 heading normalization；若仍切不出 section，會 fallback 對 full 做 chunking，避免只留下 doc-level points。

台灣民事判決 → 向量資料庫 (Qdrant) 匯入

範圍：僅限台灣民事法院判決（民事判決）。此技能將台灣民事案件（HTML 或 PDF 檔案）匯入 Qdrant。所有解析、分塊和嵌入邏輯都位於 scripts/ingest.py 中 — 您的工作是執行該腳本，而非重新實作整個流程。

快速開始（請依序執行以下步驟）

步驟 1 — 啟動虛擬環境

bash
source {baseDir}/.venv/bin/activate

步驟 2 — 確認執行資料夾

使用者將提供執行資料夾的絕對路徑。

範例：/path/to/output/judicialyuan/20260305_142030

確認該資料夾存在且包含 HTML 或 PDF 檔案：
bash
ls /archive/ | grep -E \.(html|pdf)$ | head -5

如果沒有 archive/.html 或 archive/.pdf 檔案 → 停止並告知使用者該資料夾沒有可匯入的資料。

步驟 3 — 執行匯入

全程使用絕對路徑 — 無需 cd：

bash
python3 {baseDir}/scripts/ingest.py \
--run-folder

該腳本會處理所有事項：前置檢查、自動建立集合（若 civilcasedoc / civilcasechunk 不存在則建立）、正規化、分塊、嵌入、Qdrant 更新、寫入清單與報告。

對同一資料夾重複執行相同指令始終安全 — 確定性 ID 意味著更新即覆寫。無需特殊的 --resume 標誌；只需再次執行相同指令即可。

步驟 4 — 檢查結果

成功輸出如下：

OK files=42 processed=42 skipped=0 errored=0 docpoints=42 chunkpoints=187
manifest=FOLDER>/ingestmanifest.jsonl
report=FOLDER>/ingestreport.md

閱讀報告（人類可讀的統計摘要）：
bash
cat FOLDER>/ingestreport.md

如果有錯誤，請檢查清單（機器可讀，每檔案一行 JSON）以取得各檔案診斷資訊：
bash
grep -E status:(skipped|error|partial) FOLDER>/ingestmanifest.jsonl

步驟 5 — 向使用者報告

告知使用者：

- 已匯入多少文件（docpoints）
已建立多少區塊（chunkpoints）
是否有任何檔案被跳過或發生錯誤
報告檔案的位置

完成。 除非使用者要求，否則不要繼續執行其他步驟。

禁止規則（重要）

- 禁止自行編寫 HTML 解析、分塊或嵌入程式碼。ingest.py 會處理所有這些。
禁止隨意修改解析/分塊邏輯。僅在使用者明確要求改善 PDF/OCR 穩健性時，才更改標題偵測或區塊備援方案，並在重新執行大批次前先以小樣本驗證。
禁止直接呼叫 Qdrant 或 Ollama API。腳本會處理這些。
禁止對任何 HTTP 請求使用 verify=False 或跳過 SSL 驗證。
禁止修改或刪除 archive/ 下的檔案。原始 HTML 是不可變的事實來源。
禁止更改分塊預設值（--max-chars、--overlap-chars），除非使用者明確要求。

硬性限制

- 原始 HTML/PDF 是事實來源；絕不覆寫它。
確定性：相同輸入 → 相同正規化文字 → 相同 SHA-256 → 相同 Qdrant 點 ID。可安全重新執行。
可追溯性：每個 Qdrant 點都帶有 docurl + localpath。
批次更新（每批次 ≤ 64 點）以避免 Qdrant 32MB 負載限制。
parser_version 存在於每個點的中繼資料中。目前：v3.5-sentence-boundary。

疑難排解

PREFLIGHT_FAILED: Qdrant not reachable

Qdrant 已關閉或在預設/設定的 URL 上無法連線。

bash

檢查 Qdrant 是否正在執行

curl -s http://localhost:6333/collections | head -1

如果未執行，請啟動它（或詢問使用者）

PREFLIGHT_FAILED: Ollama not reachable

bash

檢查 Ollama

curl -s http://localhost:11434/api/tags | head -5

PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest

bash
ollama pull bge-m3:latest

然後重新執行步驟 3。

PREFLIGHT_FAILED: No archive/.html or archive/.pdf found

執行資料夾存在但沒有存檔的詳細頁面。檢查：

- 這是正確的執行資料夾嗎？

輸出顯示 skipped > 0 或 errored > 0

檢查 ingest_manifest.jsonl 以取得各檔案詳細資訊：
bash
grep -E status:(skipped|error|partial) FOLDER>/ingestmanifest.jsonl

清單狀態	含義	操作
ok	文件 + 所有區塊已匯入	無
partial

重新執行始終安全 — 使用完全相同的指令。無需特殊標誌；確定性 ID → 更新/覆寫。

覆寫服務端點

bash

透過環境變數

OLLAMAURL=http://localhost:11434 QDRANTURL=http://localhost:6333 \
python3 scripts/ingest.py --run-folder ...

透過 CLI 標誌（優先於環境變數）

python3 scripts/ingest.py --run-folder ... \ --ollama http://localhost:11434 --qdrant http://localhost:6333

預設端點：

服務	預設值	環境變數覆寫
Ollama	http://localhost:11434	$OLLAMAURL
Qdrant

http://localhost:6333 | $QDRANTURL |

先以小批次測試

bash
python3 scripts/ingest.py --run-folder ... --limit 5

輸入資料夾結構（預期）

/
archive/
fjuddetail001.html ← HTML 輸入
fjuddetail002.html
fjuddetail003.pdf ← PDF 輸入（也支援）
fintdetail001.html （如果 system=both）
results_fjud.jsonl （可選）
results_fint.jsonl （可選）

腳本會自動發現所有 archive/.html 和 archive/.pdf 檔案（按檔名排序）。HTML 和 PDF 檔案可以在同一執行資料夾中共存。

v1 限制：system 中繼資料欄位目前硬編碼為 FJUD。如果執行資料夾同時包含 FJUD 和 FINT 檔案，FINT 檔案會被匯入但錯誤標記為 FJUD。這不影響分塊或嵌入 — 僅影響結果 Qdrant 點上的 system 中繼資料欄位。

CLI 參考

python3 scripts/ingest.py --run-folder [options]

標誌	預設值	說明
--run-folder	（必要）	輸入資料夾的路徑
--ollama

civil-judgment-taiwan-vectorstore台湾民事判决向量库

civil-judgment-taiwan-vectorstore

Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion

Quick Start (follow these steps in order)

Step 1 — Activate venv

Step 2 — Identify the run folder

Step 3 — Run ingestion

Step 4 — Check the result

Step 5 — Report to user

DO NOT rules (critical)

Hard constraints

Troubleshooting

PREFLIGHT_FAILED: Qdrant not reachable

PREFLIGHT_FAILED: Ollama not reachable

PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest

PREFLIGHT_FAILED: No archive/*.html or archive/*.pdf found

Output shows skipped > 0 or errored > 0

Override service endpoints

Test with a small batch first

Input folder structure (expected)

CLI reference

Outputs

Roadmap

Internal details

Lessons learned / operational gotchas

台灣民事判決 → 向量資料庫 (Qdrant) 匯入

快速開始（請依序執行以下步驟）

步驟 1 — 啟動虛擬環境

步驟 2 — 確認執行資料夾

步驟 3 — 執行匯入

步驟 4 — 檢查結果

步驟 5 — 向使用者報告

禁止規則（重要）

硬性限制

疑難排解

PREFLIGHT_FAILED: Qdrant not reachable

檢查 Qdrant 是否正在執行

如果未執行，請啟動它（或詢問使用者）

PREFLIGHT_FAILED: Ollama not reachable

檢查 Ollama

PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest

PREFLIGHT_FAILED: No archive/.html or archive/.pdf found

輸出顯示 skipped > 0 或 errored > 0

覆寫服務端點

透過環境變數

透過 CLI 標誌（優先於環境變數）

先以小批次測試

輸入資料夾結構（預期）

CLI 參考

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

`PREFLIGHT_FAILED: Qdrant not reachable`

`PREFLIGHT_FAILED: Ollama not reachable`

`PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest`

`PREFLIGHT_FAILED: No archive/.html or archive/.pdf found`

Output shows `skipped > 0` or `errored > 0`