AI Guardrails (Deep Workflow)
Guardrails turn product and legal policy into enforced behavior: blocking, rewriting, logging, and human review—with attention to false positives and latency.
When to Offer This Workflow
Trigger conditions:
- - Launching consumer-facing LLM features
- Jailbreak attempts, policy violations, or PII leakage risks
- Region-specific compliance (minors, regulated advice)
Initial offer:
Use six stages: (1) policy scope, (2) threat model, (3) controls stack, (4) implementation patterns, (5) monitoring & review, (6) iteration & appeals). Confirm latency budget and jurisdictions.
Stage 1: Policy Scope
Goal: Define prohibited categories (hate, sexual content, violence, self-harm, malware instructions, etc.) and required disclaimers for sensitive domains (medical, legal).
Exit condition: Policy document owned by legal/product; escalation path for gray areas.
Stage 2: Threat Model
Goal: Identify adversaries (prompt injection, data exfiltration, tool abuse) and assets (user data, system prompts, connectors).
Stage 3: Controls Stack
Goal: Layer defenses: input screening, model safety APIs, output classifiers, tool sandboxing, allowlists for tools and URLs.
Stage 4: Implementation Patterns
Goal: Structured refusal messages; telemetry on every block; distinguish block vs rewrite vs warn; avoid silent failures.
Stage 5: Monitoring & Review
Goal: Sample borderline cases for human review; dashboards on block rates by category; abuse spike alerts.
Stage 6: Iteration & Appeals
Goal: User appeals path where appropriate; version policy changes; measure false positives by locale and use case.
Final Review Checklist
- - [ ] Policy categories and owners defined
- [ ] Threat model aligned with product
- [ ] Layered controls with clear responsibilities
- [ ] Telemetry and review for edge cases
- [ ] Appeals and iteration process where applicable
Tips for Effective Guidance
- - Defense in depth—no single classifier is sufficient.
- Pair with moderation for UGC and tool-calling for agent safety.
Handling Deviations
- - Enterprise internal bots: emphasize data-leak prevention and connector scope over public “safety” categories alone.
技能名称: guard
详细描述:
AI护栏(深度工作流)
护栏将产品与法律政策转化为强制执行行为:拦截、重写、记录和人工审核——同时关注误报率和延迟。
何时提供此工作流
触发条件:
- - 推出面向消费者的LLM功能
- 越狱尝试、违反政策或PII泄露风险
- 特定区域合规要求(未成年人、受监管建议)
初始提供:
使用六个阶段:(1) 政策范围、(2) 威胁模型、(3) 控制栈、(4) 实现模式、(5) 监控与审核、(6) 迭代与申诉。确认延迟预算和管辖区域。
阶段1:政策范围
目标: 定义禁止类别(仇恨言论、色情内容、暴力、自残、恶意软件指令等)以及敏感领域(医疗、法律)所需的免责声明。
退出条件: 由法务/产品部门拥有政策文档;为灰色地带建立升级路径。
阶段2:威胁模型
目标: 识别攻击者(提示注入、数据窃取、工具滥用)和资产(用户数据、系统提示、连接器)。
阶段3:控制栈
目标: 分层防御:输入筛查、模型安全API、输出分类器、工具沙箱、工具和URL白名单。
阶段4:实现模式
目标: 结构化拒绝消息;每次拦截的遥测数据;区分拦截、重写与警告;避免静默失败。
阶段5:监控与审核
目标: 抽样边界案例进行人工审核;按类别展示拦截率的仪表盘;滥用激增警报。
阶段6:迭代与申诉
目标: 在适当时提供用户申诉路径;版本化政策变更;按地区和用例衡量误报率。
最终审核清单
- - [ ] 已定义政策类别和负责人
- [ ] 威胁模型与产品对齐
- [ ] 分层控制并明确职责
- [ ] 边缘案例的遥测与审核
- [ ] 适当时建立申诉与迭代流程
有效指导技巧
- - 深度防御——单一分类器不足以应对。
- 将内容审核用于UGC,将工具调用用于代理安全。
处理偏差情况
- - 企业内部机器人:强调数据泄露预防和连接器范围,而非仅关注公开的“安全”类别。