Hybrid Smart Fill Skill

This skill enables intelligent template filling using hybrid retrieval algorithms that combine BM25 semantic search with TF-IDF vector similarity. It automatically matches template fields with knowledge base data and fills Word documents (.docx) and Excel spreadsheets (.xlsx) with high precision.

When to Use This Skill

Use this skill when:

1. Batch Template Filling: Users need to fill multiple Word or Excel templates with data from a knowledge base
High Precision Required: Simple keyword matching is insufficient; semantic understanding is needed for accurate field matching
Knowledge Base Available: A structured knowledge base (JSON format) containing fields and values is available
Complex Field Names: Template fields require semantic matching (e.g., "法人代表" matches "法定代表人")
Placeholder Replacement: Templates contain placeholders like "XX基金" that need to be replaced with actual company names

Common trigger phrases:

- "填充模板"、"批量填充"、"智能填充"
"使用知识库"、"匹配字段"
"向量检索"、"语义检索"、"BM25"、"TF-IDF"
"自动填写Word/Excel模板"

Core Concepts

Hybrid Retrieval System

This skill uses a hybrid retrieval approach combining two algorithms:

1. BM25 (Best Matching 25): Statistical ranking function based on term frequency and document frequency

- Accounts for document length normalization - Penalizes overly common terms - Scores: INLINECODE0

2. TF-IDF (Term Frequency-Inverse Document Frequency): Vector similarity search

- Converts text to vector space - Calculates cosine similarity between query and documents - Semantic matching beyond exact keywords

3. Hybrid Score: Weighted fusion of both results

- Formula: final_score = 0.5 × BM25_score + 0.5 × TF-IDF_score - Balances precision (BM25) and semantic understanding (TF-IDF)

Matching Strategy

The system uses a multi-level matching strategy:

1. Exact Match: Field name exactly matches knowledge base key
Containment Match: Field name contains or is contained in knowledge base key
Keyword Match: Multi-keyword combination matching
Special Handling: Auto-replacement of placeholders (e.g., "XX基金" → "国寿安保基金")

How to Use This Skill

Step 1: Prepare Knowledge Base

Ensure the knowledge base is a JSON file with the following structure:

CODEBLOCK0

Supported formats in JSON:

- xlsx: Text-based Excel format with A1[Value] | B2[Value] pattern
docx: Dictionary or list format containing paragraphs and table data
doc: Plain text format

Step 2: Run the Smart Filler

Execute the main filling script:

CODEBLOCK1

The script will:

1. Load and parse the knowledge base JSON
Extract structured data (89+ typical fields)
Build hybrid retrieval index
Process all template files in the template directory
Fill matched fields and replace placeholders
Save filled files to output directory

Step 3: Review Results

The system generates:

- Filled templates in the output directory (marked with "已填写" suffix)
Fill log showing all field matches and replacements
Statistics: Total fields filled, success rate, XX基金 replacement count

Bundled Scripts

scripts/vector_kb.py

Purpose: Core hybrid retrieval engine implementation

Key Classes:

- BM25Retriever: BM25 ranking algorithm implementation
INLINECODE4: TF-IDF vector search implementation
INLINECODE5: Fusion of both retrieval methods
INLINECODE6: Knowledge base management and indexing

Usage Example:
CODEBLOCK2

scripts/smart_filler.py

Purpose: Main template filling orchestration

Key Classes:

- TextExcelParser: Parses text-based Excel content
INLINECODE8: Orchestrates the entire filling process

Usage Example:
CODEBLOCK3

Configuration:

- kb_path: Path to knowledge base JSON file
INLINECODE10: Directory containing template files
INLINECODE11: Directory for filled output files

Reference Documentation

Knowledge Base Format Requirements

Excel Content Format (text-based):
CODEBLOCK4

Document Content Format (field extraction):

- Use regex patterns to extract: INLINECODE12
Supported fields: 法人代表, 联系电话, 地址, 注册资本, 统一社会信用代码, etc.

Year-based Data:

- Automatic organization by year (e.g., "2024年总资产")
Cleaned headers (year removed) for better matching

Performance Characteristics

Based on real-world testing:

Metric	Value
Knowledge Base Fields	89+
Files Processed

Common Issues and Solutions

Issue: Low Match Rate

Cause: Knowledge base content format incompatible

Solution: Ensure Excel content uses A1[Value] format; check JSON structure

Issue: Wrong Value Filled

Cause: Field name ambiguity

Solution: Adjust hybrid retrieval weights; use more specific field names in templates

Issue: Encoding Errors

Cause: Non-UTF-8 characters in knowledge base

Solution: Ensure knowledge base JSON is UTF-8 encoded; use sys.stdout.reconfigure(encoding='utf-8') in scripts

Advanced Usage

Custom Retrieval Weights

Modify the hybrid retrieval weight balance in HybridRetriever:

CODEBLOCK5

Custom Field Extraction

Extend TextExcelParser._extract_from_text() to support additional patterns:

CODEBLOCK6

Batch Processing

Process multiple knowledge bases:

CODEBLOCK7

Limitations

1. No Machine Learning Embeddings: Uses TF-IDF (not BERT/Transformer embeddings) for lightweight deployment
Chinese Tokenization: Simple character-based tokenization (not jieba)
Excel Format: Requires text-based format; binary Excel files need pre-processing
Context Awareness: Limited cell-to-cell context understanding

Future Enhancements

Potential improvements for future versions:

1. Deep Learning Embeddings: Integrate sentence-transformers for true semantic vectors
Cross-Modal Fusion: Combine table structure information with text matching
Adaptive Weighting: Learn optimal BM25/TF-IDF weights from user feedback
Domain Adaptation: Build domain-specific vocabularies for finance, legal, etc.

References

For deeper understanding:

- BM25 Algorithm: Robertson, S. E., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond
TF-IDF: Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval
Hybrid Retrieval: Combining multiple evidence sources in search systems

混合智能填充技能

该技能利用结合BM25语义搜索与TF-IDF向量相似度的混合检索算法，实现智能模板填充。它能自动将模板字段与知识库数据进行匹配，并高精度地填充Word文档（.docx）和Excel电子表格（.xlsx）。

何时使用该技能

在以下情况下使用此技能：

1. 批量模板填充：用户需要使用知识库中的数据填充多个Word或Excel模板
高精度要求：简单的关键词匹配不足以满足需求，需要语义理解来实现精确的字段匹配
知识库可用：存在包含字段和值的结构化知识库（JSON格式）
复杂字段名称：模板字段需要语义匹配（例如，法人代表匹配法定代表人）
占位符替换：模板中包含需要替换为实际公司名称的占位符，如XX基金

常见触发短语：

- 填充模板、批量填充、智能填充
使用知识库、匹配字段
向量检索、语义检索、BM25、TF-IDF
自动填写Word/Excel模板

核心概念

混合检索系统

该技能采用结合两种算法的混合检索方法：

1. BM25（最佳匹配25）：基于词频和文档频率的统计排序函数

- 考虑文档长度归一化 - 惩罚过于常见的术语 - 评分公式：IDF × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × doc_length / avgdl))

2. TF-IDF（词频-逆文档频率）：向量相似度搜索

- 将文本转换为向量空间 - 计算查询与文档之间的余弦相似度 - 超越精确关键词的语义匹配

3. 混合评分：两种结果的加权融合

- 公式：finalscore = 0.5 × BM25score + 0.5 × TF-IDF_score - 平衡了精确度（BM25）和语义理解（TF-IDF）

匹配策略

系统采用多级匹配策略：

1. 精确匹配：字段名称与知识库键完全匹配
包含匹配：字段名称包含或被包含于知识库键中
关键词匹配：多关键词组合匹配
特殊处理：自动替换占位符（例如，XX基金 → 国寿安保基金）

如何使用该技能

步骤1：准备知识库

确保知识库是JSON文件，结构如下：

json
{
filename.xlsx: {
filename: filename.xlsx,
type: xlsx,
content: === Sheet: SheetName\nA1[Header1] | A2[Value1] | ...
},
filename.docx: {
filename: filename.docx,
type: docx,
content: {
paragraphs: [文本内容...],
tables: [...]
}
}
}

JSON中支持的格式：

- xlsx：基于文本的Excel格式，采用A1[Value] | B2[Value]模式
docx：包含段落和表格数据的字典或列表格式
doc：纯文本格式

步骤2：运行智能填充器

执行主填充脚本：

bash
python scripts/smart_filler.py

该脚本将：

1. 加载并解析知识库JSON
提取结构化数据（89+个典型字段）
构建混合检索索引
处理模板目录中的所有模板文件
填充匹配的字段并替换占位符
将填充后的文件保存到输出目录

步骤3：查看结果

系统生成：

- 填充后的模板在输出目录中（带有已填写后缀标记）
填充日志显示所有字段匹配和替换情况
统计信息：填充字段总数、成功率、XX基金替换次数

捆绑脚本

scripts/vector_kb.py

目的：核心混合检索引擎实现

关键类：

- BM25Retriever：BM25排序算法实现
TFIDFRetriever：TF-IDF向量搜索实现
HybridRetriever：两种检索方法的融合
VectorKnowledgeBase：知识库管理和索引

使用示例：
python
from vector_kb import VectorKnowledgeBase

初始化并加载知识库

kb = VectorKnowledgeBase() kb.loadknowledgebase(knowledgebase.json).buildindex()

搜索值

results = kb.search(法人代表, top_k=5) for result in results: print(fScore: {result[score]}, Value: {result[document]})

scripts/smart_filler.py

目的：主模板填充编排

关键类：

- TextExcelParser：解析基于文本的Excel内容
SmartFillSystem：编排整个填充过程

使用示例：
python
from smart_filler import SmartFillSystem

配置路径

system = SmartFillSystem( kbpath=knowledgebase.json, template_dir=templates/, output_dir=filled/ )

初始化并处理

system.load_kb() system.process_all()

配置项：

- kbpath：知识库JSON文件路径
templatedir：包含模板文件的目录
output_dir：填充后输出文件的目录

参考文档

知识库格式要求

Excel内容格式（基于文本）：

=== Sheet: SheetName ===
A1[Header1] | A2[Value1] | B1[Header2] | B2[Value2]

文档内容格式（字段提取）：

- 使用正则表达式提取：字段名[：:\s]*值
支持的字段：法人代表、联系电话、地址、注册资本、统一社会信用代码等

基于年份的数据：

- 按年份自动组织（例如，2024年总资产）
清理后的标题（去除年份）以便更好地匹配

性能特征

基于实际测试：

指标	数值
知识库字段	89+
处理文件数

常见问题及解决方案

问题：匹配率低

原因：知识库内容格式不兼容

解决方案：确保Excel内容使用A1[Value]格式；检查JSON结构

问题：填充值错误

原因：字段名称歧义

解决方案：调整混合检索权重；在模板中使用更具体的字段名称

问题：编码错误

原因：知识库中存在非UTF-8字符

解决方案：确保知识库JSON为UTF-8编码；在脚本中使用sys.stdout.reconfigure(encoding=utf-8)

高级用法

自定义检索权重

在HybridRetriever中修改混合检索权重平衡：

python

默认：BM25 0.5, TF-IDF 0.5

改为强调语义匹配：

self.bm25_weight = 0.3
self.tfidf_weight = 0.7

自定义字段提取

扩展TextExcelParser.extractfrom_text()以支持更多模式：

python
patterns = {
new_field: r新字段[：:\s]*([^\n\r]+),
# 添加更多模式...
}

批量处理

处理多个知识库：

python
kb_files = [kb1.json, kb2.json, kb3.json]
for kbfile in kbfiles:
system = SmartFillSystem(kbfile, templates/, ffilled{kb_file}/)
system.load_kb()
system.process_all()

局限性

1. 无机器学习嵌入：使用TF-IDF（而非BERT/Transformer嵌入）以实现轻量级部署
中文分词：基于字符的简单分词（非jieba）
Excel格式：需要基于文本的格式；二进制Excel文件需要预处理
上下文感知：单元格间上下文理解有限

未来增强

未来版本可能的改进：

1. 深度学习嵌入：集成sentence-transformers以实现真正的语义向量
跨模态融合：结合表格结构信息与文本匹配
自适应权重：从用户反馈中学习最优的BM25/TF-IDF权重
领域适应：构建金融、法律等领域的专业词汇表

参考资料

如需更深入了解：

- BM25算法：Robertson, S. E., & Zaragoza, H. (2009

hybrid-smart-fill 混合智能填充