Before you let an AI agent act on its own, verify nine guardrails. The first four map every action the agent can take to a risk rating, so the irreversible, sensitive, and costly ones always pause for a person. The next five confirm the environment that contains the agent when a filter is fooled: a unique agent identity, least-privilege permissions, a sandbox, limited network egress, and action-level audit logs. The action map tells you what is safe to automate. The environment is what actually holds the line, because the content filters most checklists lean on are the layer that fails exactly when it matters. Run all nine as yes or no questions, and if any answer is no or "we are not sure," the agent is not ready to run unsupervised.
This is the pre-launch audit we run before we let an agent we built act on its own inside another company, written so a non-technical owner can run it too, on their own agent or on a vendor's. If you would rather we do this for you, see how we run responsible AI governance and risk. Everything below is yours to use.
How should I use this checklist?
Treat it like a pre-flight check, not a philosophy. Take the agent you are about to deploy (or the one a vendor is selling you) and answer each of the nine points with a hard yes or no. A "mostly" is a no. The goal is to surface the gaps before the agent touches anything real.
The nine split into two groups that do different jobs:
- Points 1 to 4 (the action map). These decide what the agent is allowed to do without asking, and what must stop for a human. This is judgment about risk, and it is the part most guides cover.
- Points 5 to 9 (the environment). These cap what the agent can reach and do at all, regardless of what it was told to do. This is the part most checklists skip, and it is the part that contains an agent when a filter is fooled.
Both groups matter, but they fail differently. The action map is about deciding correctly. The environment is about surviving a wrong decision. You need both, because no filter catches everything: Anthropic's production auto-mode classifier, one of the best there is, still misses around 17% of overeager agent actions even when tuned to almost never block a legitimate command. A 17% miss rate is fine with another layer behind it and reckless when it is the only thing between the agent and your money.
Points 1 to 4: have you mapped every action to a risk rating?
You cannot gate what you have not listed. Start by writing down every action the agent can take, then rate each one. OpenAI's agent guide gives the cleanest rating method: score each action low, medium, or high on four factors.
| Factor | Ask | High-risk signal |
|---|---|---|
| Write access | Does it only read, or does it change something? | It writes, sends, or deletes |
| Reversibility | Can the result be undone? | It cannot be taken back |
| Account permissions | What access does it need to do this? | Admin, financial, or customer-data scope |
| Financial impact | What does it cost if it goes wrong? | Real money, or lost trust |
Guardrail 1: Have you listed every action and rated it low, medium, or high? If there is an action on the list nobody rated, that is the one that hurts you. The rating sorts almost everything into three buckets:
| Risk | Examples | Rule |
|---|---|---|
| Low (read-only, reversible) | Look up an order, summarize a ticket, draft a reply | Let it run. Log it. Review after the fact. |
| Medium (writes, but recoverable) | Update a record, post an internal note, create a draft | Allow within tight limits. Alert a human. |
| High (irreversible, sensitive, costly) | Refund or payment, delete records, grant access, send an external message | Require human approval before it runs. Always. |
Guardrail 2: Does every high-risk action pause for explicit human approval before it runs? This is the single rule that prevents the worst outcomes. OpenAI names exactly these as the actions that warrant human sign-off: canceling orders, authorizing large refunds, and making payments. Add deletions, access grants, and any message that leaves the building. A refund agent can read every order it likes, but it must never move money above a small threshold without a person clicking approve.
Guardrail 3: Is the approval gate rare enough that people still read it? The trap is making the agent ask about everything. Do that, and you have built a worse system, not a safer one. Anthropic measured that users approve roughly 93% of permission prompts, so a gate that fires forty times a day is theater: the human is nominally in the loop but has stopped looking. Verify three things: the agent prompts only for genuinely risky actions, the prompt states the consequence in plain language ("Refund $1,000 to account X"), and silence defaults to "no," never "yes."
Guardrail 4: Does the agent stop and ask for help when it keeps failing? A confused agent looping on retries is its own kind of risk. Set a failure threshold, so that after a set number of failed attempts the agent halts and escalates instead of thrashing. OpenAI lists exactly two triggers for human intervention: exceeding failure thresholds, and high-risk actions. You have just covered both.
Points 5 to 9: have you verified the environment that contains it?
Here is the part the generic checklists skip, and it is the part that does the heavy lifting. Points 1 to 4 assume the agent decides correctly. Points 5 to 9 assume it sometimes will not, and cap the damage when that happens.
The reason this group matters is the most under-covered fact in agent safety: content filters fail precisely when the harmful instruction looks legitimate. Anthropic ran a test where an employee was phished, so the malicious instruction arrived from the trusted user the agent was built to serve. Across 25 retries the agent completed credential theft 24 times, because, in their words, when the user types the instruction there is nothing anomalous for a classifier to catch. The only thing that reliably stopped it was environmental: blocking the agent's network egress so the stolen data had nowhere to go. Filters guess at intent. The environment removes capability. Capability is what you can actually control.
Guardrail 5: Does the agent have its own identity, not a shared admin key? Every agent should run under a unique identity, never a shared credential and never a human's admin login. A shared key means you cannot tell which agent did what, and a compromise spreads everywhere that key reaches. A unique identity is also what makes the audit log in guardrail 9 meaningful.
Guardrail 6: Is that identity scoped to least privilege? The agent gets the narrowest set of permissions its job actually needs, and nothing more. A support agent that issues refunds should not also be able to export your customer database or change payroll. If it is tricked, the damage is bounded by what it was granted, not by whether a filter caught the trick. Use the least-powerful access that works: read-only beats read-write, and read-write-without-delete beats full access.
Guardrail 7: Does the agent run in a sandbox? The agent should operate inside an isolated environment built on established, battle-tested isolation (the same containers and sandboxes used to run untrusted code), not something hand-rolled. As Anthropic notes, those primitives have survived far more adversarial attention than anything you would build yourself. Match the containment to the user too: a developer who can read and run code and a support rep who cannot are not the same threat model, and the more powerful the tools, the tighter the box.
Guardrail 8: Is the agent's network egress limited and scoped by capability? This is the control that stopped the phishing attack above. But a simple "allowed domains" list is not enough on its own. Anthropic learned the hard way that attackers exfiltrated files through an allowed domain by routing them to their own account on it, so think of egress rules as capability grants (what the agent is allowed to do) rather than just a list of addresses it may reach.
Guardrail 9: Is every action logged at the action level for audit? You cannot govern what you cannot see. Every action the agent takes, especially the high-risk ones, should be recorded with enough detail to reconstruct what happened, who or what triggered it, and what it touched. Action-level logs are how you catch a slow problem before it becomes a headline, and how you tighten the other eight guardrails over time.
What does the finished checklist look like?
Here are the nine in one place, each phrased as a yes or no you can verify. A no is a gap to close before launch, not a footnote.
| # | Guardrail | You can launch when |
|---|---|---|
| 1 | Action inventory and risk rating | Every action is listed and rated low, medium, or high |
| 2 | Human gate on high-risk actions | Every irreversible or costly action pauses for approval |
| 3 | Rare, plain-language approvals | The gate fires only on real risk and defaults to "no" |
| 4 | Failure-threshold escalation | The agent halts and asks for help after repeated failures |
| 5 | Unique agent identity | The agent has its own identity, no shared admin keys |
| 6 | Least-privilege permissions | It can reach only what its job needs |
| 7 | Sandbox | It runs in established, isolated infrastructure |
| 8 | Limited, capability-scoped egress | Network access is fenced, not an open allowed-domains list |
| 9 | Action-level audit logs | Every action is recorded and reviewable |
Notice the shape. The first four are decisions you make about risk; the last five are controls you build into the environment. The first four can be fooled. The last five cannot be talked out of doing their job, which is why they are non-negotiable even when the model is excellent.
How do I run this audit against a vendor's agent?
The same nine questions work just as well on someone else's agent, and they are the fastest way to tell a contained product from a confident demo. A demo proves the agent works on a good day. The checklist proves what happens on a bad one.
Ask the vendor to answer these in plain language:
- Which of my actions require human approval before they run? A vendor who cannot name them has not rated them.
- Does the agent get its own identity with least-privilege access, or does it use a shared key into my systems? The second answer is a red flag.
- Is it sandboxed, and what can it reach on the network? "It can reach the internet" is not an answer.
- Is every action logged, and can I see those logs? If you cannot audit it, you cannot govern it.
- What is the blast radius if the agent is tricked? The honest answer is a list of what it can touch, not a promise that it never will be.
If the answers are vague, or rest entirely on "the model is well-behaved," the agent is not contained, no matter how good the demo looked. A good vendor will have answers ready, because these are the same questions they should have asked themselves.
Why does this matter now?
Because the gap between teams that pass this audit and teams that skip it is about to show up in the numbers. Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027, with inadequate risk controls named among the causes, and predicts that by 2028, 25% of enterprise generative-AI applications will suffer at least five minor security incidents a year, up from 9% in 2025. Guardrails are moving from a config setting to a budgeted control plane, and the audit above is how you stay on the right side of that shift.
The encouraging part is that none of this is exotic. Map every action to a risk rating and gate the irreversible. Give the agent its own identity with the least access it needs, in a sandbox, with the network fenced and every action logged. Keep the human gate rare enough that people still read it. Do that, and an attacker who slips past your filter still runs into a wall of capability they were never granted.
If you want these nine guardrails built, verified, and run for you before your first agent goes live, that is exactly the work we do inside other companies' stacks. Book a free consultation below and we will run this checklist against your agent together.