AI Guardrails (Deep Workflow)

Guardrails turn product and legal policy into enforced behavior: blocking, rewriting, logging, and human review—with attention to false positives and latency.

When to Offer This Workflow

Trigger conditions:

- Launching consumer-facing LLM features
Jailbreak attempts, policy violations, or PII leakage risks
Region-specific compliance (minors, regulated advice)

Initial offer:

Use six stages: (1) policy scope, (2) threat model, (3) controls stack, (4) implementation patterns, (5) monitoring & review, (6) iteration & appeals). Confirm latency budget and jurisdictions.

Stage 1: Policy Scope

Goal: Define prohibited categories (hate, sexual content, violence, self-harm, malware instructions, etc.) and required disclaimers for sensitive domains (medical, legal).

Exit condition: Policy document owned by legal/product; escalation path for gray areas.

Stage 2: Threat Model

Goal: Identify adversaries (prompt injection, data exfiltration, tool abuse) and assets (user data, system prompts, connectors).

Stage 3: Controls Stack

Goal: Layer defenses: input screening, model safety APIs, output classifiers, tool sandboxing, allowlists for tools and URLs.

Stage 4: Implementation Patterns

Goal: Structured refusal messages; telemetry on every block; distinguish block vs rewrite vs warn; avoid silent failures.

Stage 5: Monitoring & Review

Goal: Sample borderline cases for human review; dashboards on block rates by category; abuse spike alerts.

Stage 6: Iteration & Appeals

Goal: User appeals path where appropriate; version policy changes; measure false positives by locale and use case.

Final Review Checklist

- [ ] Policy categories and owners defined
[ ] Threat model aligned with product
[ ] Layered controls with clear responsibilities
[ ] Telemetry and review for edge cases
[ ] Appeals and iteration process where applicable

Tips for Effective Guidance

- Defense in depth—no single classifier is sufficient.
Pair with moderation for UGC and tool-calling for agent safety.

Handling Deviations

- Enterprise internal bots: emphasize data-leak prevention and connector scope over public “safety” categories alone.

技能名称: guard
详细描述:

AI护栏（深度工作流）

护栏将产品与法律政策转化为强制执行行为：拦截、重写、记录和人工审核——同时关注误报率和延迟。

何时提供此工作流

触发条件：

- 推出面向消费者的LLM功能
越狱尝试、违反政策或PII泄露风险
特定区域合规要求（未成年人、受监管建议）

初始提供：

使用六个阶段：(1) 政策范围、(2) 威胁模型、(3) 控制栈、(4) 实现模式、(5) 监控与审核、(6) 迭代与申诉。确认延迟预算和管辖区域。

阶段1：政策范围

目标： 定义禁止类别（仇恨言论、色情内容、暴力、自残、恶意软件指令等）以及敏感领域（医疗、法律）所需的免责声明。

退出条件： 由法务/产品部门拥有政策文档；为灰色地带建立升级路径。

阶段2：威胁模型

目标： 识别攻击者（提示注入、数据窃取、工具滥用）和资产（用户数据、系统提示、连接器）。

阶段3：控制栈

目标： 分层防御：输入筛查、模型安全API、输出分类器、工具沙箱、工具和URL白名单。

阶段4：实现模式

目标： 结构化拒绝消息；每次拦截的遥测数据；区分拦截、重写与警告；避免静默失败。

阶段5：监控与审核

目标： 抽样边界案例进行人工审核；按类别展示拦截率的仪表盘；滥用激增警报。

阶段6：迭代与申诉

目标： 在适当时提供用户申诉路径；版本化政策变更；按地区和用例衡量误报率。

最终审核清单

- [ ] 已定义政策类别和负责人
[ ] 威胁模型与产品对齐
[ ] 分层控制并明确职责
[ ] 边缘案例的遥测与审核
[ ] 适当时建立申诉与迭代流程

有效指导技巧

- 深度防御——单一分类器不足以应对。
将内容审核用于UGC，将工具调用用于代理安全。

处理偏差情况

- 企业内部机器人：强调数据泄露预防和连接器范围，而非仅关注公开的“安全”类别。

guardAI安全护栏

guard

AI Guardrails (Deep Workflow)

When to Offer This Workflow

Stage 1: Policy Scope

Stage 2: Threat Model

Stage 3: Controls Stack

Stage 4: Implementation Patterns

Stage 5: Monitoring & Review

Stage 6: Iteration & Appeals

Final Review Checklist

Tips for Effective Guidance

Handling Deviations

AI护栏（深度工作流）

何时提供此工作流

阶段1：政策范围

阶段2：威胁模型

阶段3：控制栈

阶段4：实现模式

阶段5：监控与审核

阶段6：迭代与申诉

最终审核清单

有效指导技巧

处理偏差情况

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

guardAI安全护栏

guard

AI Guardrails (Deep Workflow)

When to Offer This Workflow

Stage 1: Policy Scope

Stage 2: Threat Model

Stage 3: Controls Stack

Stage 4: Implementation Patterns

Stage 5: Monitoring & Review

Stage 6: Iteration & Appeals

Final Review Checklist

Tips for Effective Guidance

Handling Deviations

AI护栏（深度工作流）

何时提供此工作流

阶段1：政策范围

阶段2：威胁模型

阶段3：控制栈

阶段4：实现模式

阶段5：监控与审核

阶段6：迭代与申诉

最终审核清单

有效指导技巧

处理偏差情况

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement