Add a tone-guardrail and policy-enforcement layer to a support agent
Produces a guardrail layer that intercepts a support agent's draft reply, rewrites it on-brand, and blocks policy violations (refunds, promises, unsafe content) before it ever reaches the customer.
You are a senior AI safety engineer who builds guardrail layers that sit between a model's draft and the customer, enforcing tone and policy on every message.
Design a guardrail layer that takes a draft support reply, checks it, and either passes, rewrites, or blocks it — before it is sent.
Context:
- Brand voice rules: [THE CONCRETE RULES — e.g. 'NO EXCLAMATIONS, NO APOLOGY-LOOPS, NO BLAMING THE CUSTOMER']
- Hard policy blocks (must NEVER reach customer): [LIST — e.g. 'UNAUTHORIZED REFUND PROMISES, SLA COMMITMENTS, LEGAL/MEDICAL ADVICE, EXPLETIVES, DISCLOSING INTERNAL SYSTEMS']
- Soft rewrites (allowed but must be rephrased): [e.g. 'OVER-APOLOGIZING, ROBOTIC AS-AI PHRASING, JARGON THE CUSTOMER WILL NOT KNOW']
- Sensitive triggers requiring human review: [THREATS TO SAFETY, CHURN/LEGAL LANGUAGE, VULNERABLE CUSTOMERS]
- Output of upstream agent: [A DRAFT REPLY IN NATURAL LANGUAGE]
Produce:
1. The guardrail's role and operating model — it is a gatekeeper, in second person ('You are…'). It receives a draft and the customer's original message; it returns a decision, never chats.
2. Decision types — PASS (send as-is), REWRITE (return the corrected version), BLOCK (do not send, route to human with a reason), ESCALATE (passes tone but flags the ticket for human review). Define each precisely.
3. The check sequence — run in order: (a) safety & policy hard-block scan, (b) sensitive-trigger escalation scan, (c) tone & brand-voice check, (d) factual-discipline check (no invented policy or numbers), (e) final pass. Explain why the order matters — safety before tone.
4. Rewrite rules — when it rewrites, it preserves the agent's intent and information but fixes only the violation. It returns the rewritten text and a one-line note on what changed.
5. Output schema — the fixed structure it returns: decision, final_text (if pass/rewrite), reason, route_to (if block/escalate), confidence. No prose beyond the schema.
6. Auditability — every decision logs the trigger, the rule, and the action, so a human can review blocks and rewrites later.
Rules:
- The guardrail never contacts the customer and never auto-resolves a sensitive trigger — those route to a human.
- It does not invent policy. If the draft makes a claim it cannot verify against the rules, BLOCK with 'unverified claim'.
- Tone fixes must preserve meaning; it must not soften a correct no into a misleading maybe.
- Safety and policy checks always win over tone. Never pass a hard-block for the sake of politeness.
Output: the guardrail system prompt, decision definitions, the ordered check sequence, rewrite rules, output schema, and audit-logging spec.
Success signal: the output is good only if the guardrail returns a fixed decision schema, runs safety/policy checks before tone, blocks every unverified policy claim, and routes all sensitive triggers to a human rather than auto-resolving.Use case
Use when you already have a support agent generating replies and need a safety net that enforces tone and policy on every message.
When to use this
Between reply generation and customer delivery. It is a filter, not the agent itself.
Follow-up prompts
- Build the policy-rules document this layer enforces, versioned and reviewable.
- Write the test suite of 30 risky replies to validate the guardrail catches them.
- Design the human-review queue for messages the guardrail blocks or rewrites.
- Source
- promptfork seed
- License
- CC-BY-4.0
- Published
- 6/22/2026