Master Data Intelligent Matching System
Overview
A production-ready skill for intelligent entity resolution across business domains. It combines exact-match and vector-semantic retrieval, OCR field mapping with confidence coloring, and human-in-the-loop verification with active learning.
Usage
CODEBLOCK0
Key Features
Business Domain Isolation
Four isolated schemas:
- - procurement — vendor records (vendorname, vendorcode, taxid, contact, etc.)
- finance — company records (companyname, registrationnumber, fiscalyearend, etc.)
- sales — customer records (customername, customercode, industry, creditlimit, etc.)
- hr — employee records (employeename, employeeid, id_number, department, etc.)
OCR Field to Schema Visual Line Mapping
buildOcrSchemaMapping(ocrFields, domain) maps raw OCR field names to schema fields with confidence colors:
| Color | Score | Meaning |
|---|
| 🟢 green | ≥ 0.92 | High confidence mapping |
| 🟡 yellow |
0.70–0.92 | Medium confidence mapping |
| 🔴 red | < 0.70 | Low confidence / unmapped |
| 🔵 blue | db-only | Database field, no OCR data |
Dual-Path Entity Retrieval
dualPathEntityRetrieval(entity, domain, dbRecords) runs two parallel paths:
- 1. Exact Match (threshold 0.92) — ALL critical fields must match exactly
- Vector Semantic (threshold 0.70) — weighted similarity across all fields
Results include needsHumanReview: true if confidence < 0.92 or no match found.
Field Value Verification
verifyFieldValues(ocrEntity, dbRecord, domain) returns 4-state verification per field:
| State | Meaning |
|---|
| INLINECODE4 | OCR and DB values agree |
| INLINECODE5 |
Values differ (requires human resolution) |
|
new_info | Field only in OCR (new information) |
|
db_only | Field only in DB (not in OCR document) |
Human-in-the-Loop
Every pipeline result generates a
hitlRequest with:
- - Mismatched fields highlighted
- New info fields listed
- Available review actions: confirmmatch, rejectmatch, createnew, updatefields
Use processHumanDecision(decision, state) to process human feedback and generate learning payloads.
Active Learning
updateActiveLearning(payloads, stats) tracks:
- - Per-domain confirmation/rejection/new-record rates
- Per-field error rates
- Auto-adjusts thresholds when field error rate > 30%
Example
CODEBLOCK1
API Reference
| Function | Description |
|---|
| INLINECODE11 | List all supported business domains |
| INLINECODE12 |
Get field schema for a domain |
|
buildOcrSchemaMapping(ocr, dom) | Map OCR fields to schema with confidence |
|
dualPathEntityRetrieval(...) | Run exact + semantic matching |
|
verifyFieldValues(...) | 4-state field verification |
|
runMatchingPipeline(...) | Full orchestration pipeline |
|
generateHitlReviewRequest(...) | Build human review request payload |
|
processHumanDecision(...) | Handle human feedback |
|
updateActiveLearning(...) | Update learning stats from decisions |
|
formatMatchingSummary(...) | Human-readable result summary |
主数据智能匹配系统
概述
一个面向业务领域的生产级智能实体解析技能。它结合了精确匹配与向量语义检索、带置信度颜色标记的OCR字段映射,以及带主动学习的人机协同验证。
使用方法
javascript
import mdm from ./index.js;
// 1. 获取支持的领域
mdm.getSupportedDomains(); // [procurement, finance, sales, hr]
// 2. 构建带置信度颜色的OCR到模式映射
const mapping = mdm.buildOcrSchemaMapping(ocrFields, procurement);
// 3. 运行完整匹配流水线
const result = mdm.runMatchingPipeline(ocrEntity, procurement, dbRecords);
// 4. 将结果格式化为摘要
console.log(mdm.formatMatchingSummary(result));
主要特性
业务领域隔离
四个独立模式:
- - 采购 — 供应商记录(供应商名称、供应商代码、税号、联系人等)
- 财务 — 公司记录(公司名称、注册号、财政年度结束日等)
- 销售 — 客户记录(客户名称、客户代码、行业、信用额度等)
- 人力资源 — 员工记录(员工姓名、员工编号、身份证号、部门等)
OCR字段到模式的视觉连线映射
buildOcrSchemaMapping(ocrFields, domain) 将原始OCR字段名映射到模式字段,并带有置信度颜色:
| 颜色 | 分数 | 含义 |
|---|
| 🟢 绿色 | ≥ 0.92 | 高置信度映射 |
| 🟡 黄色 |
0.70–0.92 | 中等置信度映射 |
| 🔴 红色 | < 0.70 | 低置信度/未映射 |
| 🔵 蓝色 | 仅数据库 | 数据库字段,无OCR数据 |
双路径实体检索
dualPathEntityRetrieval(entity, domain, dbRecords) 运行两条并行路径:
- 1. 精确匹配(阈值0.92)— 所有关键字段必须完全匹配
- 向量语义匹配(阈值0.70)— 所有字段的加权相似度
如果置信度 < 0.92 或未找到匹配,结果包含 needsHumanReview: true。
字段值验证
verifyFieldValues(ocrEntity, dbRecord, domain) 返回每个字段的四种状态验证:
值不同(需要人工解决) |
| 新信息 | 仅OCR中存在字段(新信息) |
| 仅数据库 | 仅数据库中存在字段(OCR文档中无此字段) |
人机协同
每个流水线结果生成一个 hitlRequest,包含:
- - 高亮显示的不匹配字段
- 列出的新信息字段
- 可用的审核操作:确认匹配、拒绝匹配、创建新记录、更新字段
使用 processHumanDecision(decision, state) 处理人工反馈并生成学习负载。
主动学习
updateActiveLearning(payloads, stats) 跟踪:
- - 每个领域的确认/拒绝/新记录率
- 每个字段的错误率
- 当字段错误率 > 30% 时自动调整阈值
示例
javascript
import mdm from ./index.js;
// 来自供应商发票的示例OCR实体
const ocrVendor = {
vendor_name: Acme Corporation Ltd,
vendor_code: V-5001,
tax_id: 91110000123456789X,
contact_person: John Smith,
email: john.smith@acme.com,
};
// 现有数据库记录
const dbRecords = [
{
id: rec_001,
vendor_name: Acme Corporation Ltd,
vendor_code: V-5001,
tax_id: 91110000123456789X,
contact_person: John Smith,
email: j.smith@acme.com, // 邮箱轻微不匹配
phone: +86-10-12345678,
address: 北京市朝阳区,
bank_account: 6222021234567890,
},
];
// 运行流水线
const result = mdm.runMatchingPipeline(ocrVendor, procurement, dbRecords);
console.log(mdm.formatMatchingSummary(result));
// 处理人工决策
const decision = { action: confirm_match, notes: 邮箱不匹配可接受 };
const { status, learningPayload } = mdm.processHumanDecision(decision, {
domain: procurement,
ocrEntity: ocrVendor,
matchResult: result.matchResult,
});
// 更新主动学习
const newStats = mdm.updateActiveLearning([learningPayload], {});
API参考
| 函数 | 描述 |
|---|
| getSupportedDomains() | 列出所有支持的业务领域 |
| getDomainSchema(domain) |
获取某个领域的字段模式 |
| buildOcrSchemaMapping(ocr, dom) | 将OCR字段映射到模式并带置信度 |
| dualPathEntityRetrieval(...) | 运行精确+语义匹配 |
| verifyFieldValues(...) | 四种状态的字段验证 |
| runMatchingPipeline(...) | 完整编排流水线 |
| generateHitlReviewRequest(...) | 构建人工审核请求负载 |
| processHumanDecision(...) | 处理人工反馈 |
| updateActiveLearning(...) | 根据决策更新学习统计 |
| formatMatchingSummary(...) | 人类可读的结果摘要 |