Tool Calling (Deep Workflow)
Tool calling is contract design between a probabilistic planner (the model) and deterministic systems. Failures are usually schema, permissions, or ambiguity—not the LLM “being dumb.”
When to Offer This Workflow
Trigger conditions:
- - Designing OpenAI/Anthropic-style functions, MCP tools, or internal JSON tool protocols
- Debugging wrong arguments, hallucinated parameters, or unsafe side effects
- Building agents with many tools—selection and routing problems
Initial offer:
Use six stages: (1) define tool surface, (2) schema & validation, (3) authz & safety, (4) execution semantics, (5) errors & observability, (6) evaluation & regression. Confirm side-effect class (read-only vs write).
Stage 1: Define Tool Surface
Goal: Minimize tools; maximize clarity per tool.
Principles
- - One action per tool when possible—avoid mega-tools with mode flags unless necessary
- Names descriptive:
search_orders not INLINECODE1 - Prefer idempotent operations where writes exist; separate read vs write clearly
Anti-patterns
- - Exposing raw SQL or shell to the model
- Too many overlapping tools → routing errors
Exit condition: Tool list with purpose, inputs, outputs, side effects table.
Stage 2: Schema & Validation
Goal: Arguments are typed, constrained, and machine-validated before execution.
Practices
- - JSON Schema: enums, min/max, patterns, required fields
- Normalize dates, IDs, currencies server-side—never trust model formatting alone
- Default behaviors explicit in description + schema
Descriptions
- - Tool and parameter docstrings seen by model—precise language; examples of valid args
Exit condition: Validator rejects invalid args with actionable errors back to model or orchestrator.
Stage 3: Authorization & Safety
Goal: Every tool call runs as some principal with least privilege.
Patterns
- - User-scoped credentials carried from session; tool implementation re-checks ownership (e.g., order_id belongs to user)
- Admin tools behind explicit allowlists and human approval when needed
- Rate limits per user + global circuit breakers
Data exfiltration
- - Tools that read sensitive data need output filtering and logging policies
Exit condition: Threat brief: “What if model is tricked into calling tool X?” answered.
Stage 4: Execution Semantics
Goal: Clear transactionality, retries, and idempotency.
Design
- - Idempotency keys for writes; dedupe window
- Timeouts and cancellation propagation
- Ordering: parallel safe vs must be serial
Long operations
- - Async jobs with poll tool vs blocking calls—prefer non-blocking for UX and cost
Exit condition: Semantics documented for retry behavior (at-least-once delivery common).
Stage 5: Errors & Observability
Goal: Model (or orchestrator) can recover from failures without leaking internals.
Error messages
- - Structured error codes:
ORDER_NOT_FOUND, INLINECODE3 - Hints for model on how to fix—without stack traces to end users
Observability
- - Trace IDs across tool calls; audit log for write tools (who/when/args hash)
Exit condition: Dashboards/alerts on tool error rate, latency, denials.
Stage 6: Evaluation & Regression
Goal: Tool changes are tested like APIs.
Harness
- - Golden conversations with expected tool calls (args normalized)
- Adversarial prompts attempting privilege escalation
- Version tools; deprecate with compatibility window
Exit condition: CI or manual eval suite before deploying new tools/schemas.
Final Review Checklist
- - [ ] Minimal orthogonal tool set
- [ ] Strict schema validation on server
- [ ] AuthZ enforced per call; sensitive reads controlled
- [ ] Idempotency and timeouts defined for writes
- [ ] Structured errors + observability + eval harness
Tips for Effective Guidance
- - Treat tool descriptions as API docs the model reads—iterate wording like UX copy.
- Recommend two-step patterns for dangerous ops: propose → confirm (human or policy).
- When using MCP, same discipline—server must validate everything.
Handling Deviations
- - Read-only RAG: fewer semantic risks—still validate query args and injection into search backends.
- Local tools (filesystem): sandbox, path allowlists, size limits.
工具调用(深度工作流)
工具调用是概率性规划器(模型)与确定性系统之间的契约设计。失败通常源于模式、权限或歧义——而非LLM“能力不足”。
何时提供此工作流
触发条件:
- - 设计OpenAI/Anthropic风格的函数、MCP工具或内部JSON工具协议
- 调试错误参数、幻觉参数或不安全的副作用
- 构建包含大量工具的智能体——选择与路由问题
初始提供:
使用六个阶段:(1)定义工具表面,(2)模式与验证,(3)授权与安全,(4)执行语义,(5)错误与可观测性,(6)评估与回归。确认副作用类别(只读与写入)。
阶段1:定义工具表面
目标: 最小化工具数量;最大化每个工具的清晰度。
原则
- - 每个工具尽量对应一个操作——除非必要,避免使用带模式标志的巨型工具
- 名称具有描述性:使用searchorders而非dostuff
- 写入操作优先选择幂等操作;明确区分读取与写入
反模式
- - 向模型暴露原始SQL或shell
- 过多重叠工具 → 路由错误
退出条件: 包含用途、输入、输出、副作用表格的工具列表。
阶段2:模式与验证
目标: 参数在执行前经过类型化、约束化和机器验证。
实践
- - JSON Schema:枚举、最小值/最大值、模式、必填字段
- 在服务端规范化日期、ID、货币——绝不仅信任模型格式化
- 默认行为在描述+模式中明确说明
描述
- - 工具和参数的文档字符串供模型读取——使用精确语言;提供有效参数的示例
退出条件: 验证器拒绝无效参数,并向模型或编排器返回可操作的错误信息。
阶段3:授权与安全
目标: 每次工具调用以某个主体身份运行,遵循最小权限原则。
模式
- - 从会话携带用户范围的凭证;工具实现重新检查所有权(例如,order_id属于该用户)
- 管理员工具位于明确的白名单之后,必要时需人工审批
- 每个用户的速率限制 + 全局断路器
数据泄露
退出条件: 威胁简报:回答“如果模型被诱骗调用工具X会怎样?”的问题。
阶段4:执行语义
目标: 明确的事务性、重试机制和幂等性。
设计
- - 写入操作使用幂等键;设置去重窗口
- 超时和取消传播
- 排序:并行安全与必须串行
长操作
- - 使用轮询工具的异步任务与阻塞调用——为提升用户体验和成本效益,优先选择非阻塞方式
退出条件: 记录重试行为的语义(通常为至少一次交付)。
阶段5:错误与可观测性
目标: 模型(或编排器)能够从失败中恢复,同时不泄露内部信息。
错误消息
- - 结构化错误代码:ORDERNOTFOUND、PERMISSION_DENIED
- 为模型提供如何修复的提示——不向最终用户展示堆栈跟踪
可观测性
- - 跨工具调用的追踪ID;写入工具的审计日志(谁/何时/参数哈希)
退出条件: 针对工具错误率、延迟、拒绝次数的仪表盘/告警。
阶段6:评估与回归
目标: 工具变更像API一样经过测试。
测试框架
- - 包含预期工具调用(参数已规范化)的黄金对话
- 尝试权限提升的对抗性提示
- 版本化工具;设置兼容窗口进行弃用
退出条件: 在部署新工具/模式前,通过CI或手动评估套件。
最终审查清单
- - [ ] 最小化正交工具集
- [ ] 服务端严格模式验证
- [ ] 每次调用强制执行授权;控制敏感读取
- [ ] 写入操作定义幂等性和超时
- [ ] 结构化错误 + 可观测性 + 评估框架
有效指导技巧
- - 将工具描述视为模型读取的API文档——像UX文案一样迭代措辞。
- 对危险操作推荐两步模式:提议 → 确认(人工或策略)。
- 使用MCP时,同样遵循此规范——服务端必须验证所有内容。
处理偏差
- - 只读RAG:语义风险较小——仍需验证查询参数和搜索后端的注入问题。
- 本地工具(文件系统):沙箱、路径白名单、大小限制。