CA File Processor
This skill processes the four most common file formats used by Indian CA firms and extracts structured information from them for analysis, summarisation, and answering queries.
Supported formats
- - PDF — GST returns, ITR acknowledgements, audit reports, scanned invoices (text-layer and scanned via OCR)
- Excel (.xlsx / .xls) — Trial balance, P&L, balance sheets, payroll registers, GST workings
- CSV — Bank statement exports (HDFC, ICICI, SBI), GSTR-2B downloads, Tally exports
- Images (.jpg / .png) — WhatsApp invoice photos, scanned Form 16, cheque images
How to use
When a file is attached or uploaded, run the appropriate script:
CODEBLOCK0
The router auto-detects the file type and calls the correct processor. It returns a structured JSON dict.
What to do with the output
Once the script returns output, use it to:
- 1. Answer the user's question about the document
- Extract specific fields they asked for (GSTIN, totals, dates)
- Summarise the document in plain language
- Flag anomalies or missing information
- Compare figures across multiple documents
Field extraction — what gets detected automatically
For invoices and PDFs:
- - GSTIN (supplier and recipient)
- Invoice number and date
- Total amount / grand total
- PAN number
- Email and phone
For bank statements (CSV):
- - Total debits and credits
- Date range of transactions
- Detected bank format
For Excel files:
- - Document type (trial balance / P&L / balance sheet / payroll / GST workings / ledger)
- Sheet names and row counts
- Preview of header rows
OCR notes
- - Text-layer PDFs are read directly (fast, accurate)
- Scanned PDFs and images go through Tesseract OCR (English + Hindi)
- Confidence is rated high / medium / low in the output
- Always flag low-confidence results to the user and ask for confirmation on numeric fields
Trust statement
This skill runs entirely locally on your server. No data is sent to any external service. All processing happens via open-source Python libraries (PyMuPDF, pytesseract, openpyxl, pandas).
CA文件处理器
本技能可处理印度CA事务所常用的四种文件格式,并从中提取结构化信息,用于分析、汇总和解答查询。
支持的格式
- - PDF — GST申报表、ITR确认函、审计报告、扫描发票(含文本层及通过OCR扫描的发票)
- Excel(.xlsx / .xls) — 试算平衡表、损益表、资产负债表、工资登记簿、GST计算表
- CSV — 银行对账单导出文件(HDFC、ICICI、SBI)、GSTR-2B下载文件、Tally导出文件
- 图片(.jpg / .png) — WhatsApp发票照片、扫描版Form 16、支票图像
使用方法
当文件被附加或上传时,运行相应的脚本:
python3 scripts/skill_router.py <文件路径>
路由器会自动检测文件类型并调用正确的处理器。它将返回结构化的JSON字典。
输出处理方式
脚本返回输出后,请使用该输出:
- 1. 回答用户关于文档的问题
- 提取用户要求的特定字段(GSTIN、总额、日期)
- 用通俗语言总结文档内容
- 标记异常或缺失信息
- 跨多个文档比较数据
字段提取 — 自动检测内容
针对发票和PDF:
- - GSTIN(供应商和接收方)
- 发票编号和日期
- 总金额/合计
- PAN号码
- 电子邮箱和电话号码
针对银行对账单(CSV):
针对Excel文件:
- - 文档类型(试算平衡表/损益表/资产负债表/工资登记簿/GST计算表/分类账)
- 工作表名称和行数
- 标题行预览
OCR说明
- - 文本层PDF直接读取(快速、准确)
- 扫描版PDF和图片通过Tesseract OCR处理(英语+印地语)
- 输出结果中标注置信度为高/中/低
- 始终向用户标记低置信度结果,并要求对数字字段进行确认
信任声明
本技能完全在您的服务器本地运行。不会向任何外部服务发送数据。所有处理均通过开源Python库(PyMuPDF、pytesseract、openpyxl、pandas)完成。