Trust & SafetyJune 8, 2026·9 min read

The AI Agent Safety Checklist: 9 Guardrails to Verify Before You Let an Agent Act on Its Own

A 9-point pre-launch audit you can run against your own AI agent or a vendor's: rate every action by risk, then verify the environmental controls that contain it.

Key Facts

Before you let an AI agent act on its own, verify nine guardrails. First, map every action it can take to a risk rating (read-only versus write access, reversibility, account permissions, financial impact) so the irreversible, sensitive, and costly ones always pause for a human. Then confirm the five environmental controls generic checklists skip: a unique agent identity, least-privilege permissions, a sandbox, limited network egress, and action-level audit logs. The action map decides what is safe to automate; the environment is what actually contains the agent when a filter is fooled, which it will be, since users rubber-stamp roughly 93% of approval prompts.

Mahmoud Zalt

Founder & AI Strategist · Sistava

Before you let an AI agent act on its own, verify nine guardrails. The first four map every action the agent can take to a risk rating, so the irreversible, sensitive, and costly ones always pause for a person. The next five confirm the environment that contains the agent when a filter is fooled: a unique agent identity, least-privilege permissions, a sandbox, limited network egress, and action-level audit logs. The action map tells you what is safe to automate. The environment is what actually holds the line, because the content filters most checklists lean on are the layer that fails exactly when it matters. Run all nine as yes or no questions, and if any answer is no or "we are not sure," the agent is not ready to run unsupervised.

This is the pre-launch audit we run before we let an agent we built act on its own inside another company, written so a non-technical owner can run it too, on their own agent or on a vendor's. If you would rather we do this for you, see how we run responsible AI governance and risk. Everything below is yours to use.

How should I use this checklist?

Treat it like a pre-flight check, not a philosophy. Take the agent you are about to deploy (or the one a vendor is selling you) and answer each of the nine points with a hard yes or no. A "mostly" is a no. The goal is to surface the gaps before the agent touches anything real.

The nine split into two groups that do different jobs:

Points 1 to 4 (the action map). These decide what the agent is allowed to do without asking, and what must stop for a human. This is judgment about risk, and it is the part most guides cover.
Points 5 to 9 (the environment). These cap what the agent can reach and do at all, regardless of what it was told to do. This is the part most checklists skip, and it is the part that contains an agent when a filter is fooled.

Both groups matter, but they fail differently. The action map is about deciding correctly. The environment is about surviving a wrong decision. You need both, because no filter catches everything: Anthropic's production auto-mode classifier, one of the best there is, still misses around 17% of overeager agent actions even when tuned to almost never block a legitimate command. A 17% miss rate is fine with another layer behind it and reckless when it is the only thing between the agent and your money.

Points 1 to 4: have you mapped every action to a risk rating?

You cannot gate what you have not listed. Start by writing down every action the agent can take, then rate each one. OpenAI's agent guide gives the cleanest rating method: score each action low, medium, or high on four factors.

Factor	Ask	High-risk signal
Write access	Does it only read, or does it change something?	It writes, sends, or deletes
Reversibility	Can the result be undone?	It cannot be taken back
Account permissions	What access does it need to do this?	Admin, financial, or customer-data scope
Financial impact	What does it cost if it goes wrong?	Real money, or lost trust

Guardrail 1: Have you listed every action and rated it low, medium, or high? If there is an action on the list nobody rated, that is the one that hurts you. The rating sorts almost everything into three buckets:

Risk	Examples	Rule
Low (read-only, reversible)	Look up an order, summarize a ticket, draft a reply	Let it run. Log it. Review after the fact.
Medium (writes, but recoverable)	Update a record, post an internal note, create a draft	Allow within tight limits. Alert a human.
High (irreversible, sensitive, costly)	Refund or payment, delete records, grant access, send an external message	Require human approval before it runs. Always.

Guardrail 2: Does every high-risk action pause for explicit human approval before it runs? This is the single rule that prevents the worst outcomes. OpenAI names exactly these as the actions that warrant human sign-off: canceling orders, authorizing large refunds, and making payments. Add deletions, access grants, and any message that leaves the building. A refund agent can read every order it likes, but it must never move money above a small threshold without a person clicking approve.

Guardrail 3: Is the approval gate rare enough that people still read it? The trap is making the agent ask about everything. Do that, and you have built a worse system, not a safer one. Anthropic measured that users approve roughly 93% of permission prompts, so a gate that fires forty times a day is theater: the human is nominally in the loop but has stopped looking. Verify three things: the agent prompts only for genuinely risky actions, the prompt states the consequence in plain language ("Refund $1,000 to account X"), and silence defaults to "no," never "yes."

Guardrail 4: Does the agent stop and ask for help when it keeps failing? A confused agent looping on retries is its own kind of risk. Set a failure threshold, so that after a set number of failed attempts the agent halts and escalates instead of thrashing. OpenAI lists exactly two triggers for human intervention: exceeding failure thresholds, and high-risk actions. You have just covered both.

Points 5 to 9: have you verified the environment that contains it?

Here is the part the generic checklists skip, and it is the part that does the heavy lifting. Points 1 to 4 assume the agent decides correctly. Points 5 to 9 assume it sometimes will not, and cap the damage when that happens.

The reason this group matters is the most under-covered fact in agent safety: content filters fail precisely when the harmful instruction looks legitimate. Anthropic ran a test where an employee was phished, so the malicious instruction arrived from the trusted user the agent was built to serve. Across 25 retries the agent completed credential theft 24 times, because, in their words, when the user types the instruction there is nothing anomalous for a classifier to catch. The only thing that reliably stopped it was environmental: blocking the agent's network egress so the stolen data had nowhere to go. Filters guess at intent. The environment removes capability. Capability is what you can actually control.

Guardrail 5: Does the agent have its own identity, not a shared admin key? Every agent should run under a unique identity, never a shared credential and never a human's admin login. A shared key means you cannot tell which agent did what, and a compromise spreads everywhere that key reaches. A unique identity is also what makes the audit log in guardrail 9 meaningful.

Guardrail 6: Is that identity scoped to least privilege? The agent gets the narrowest set of permissions its job actually needs, and nothing more. A support agent that issues refunds should not also be able to export your customer database or change payroll. If it is tricked, the damage is bounded by what it was granted, not by whether a filter caught the trick. Use the least-powerful access that works: read-only beats read-write, and read-write-without-delete beats full access.

Guardrail 7: Does the agent run in a sandbox? The agent should operate inside an isolated environment built on established, battle-tested isolation (the same containers and sandboxes used to run untrusted code), not something hand-rolled. As Anthropic notes, those primitives have survived far more adversarial attention than anything you would build yourself. Match the containment to the user too: a developer who can read and run code and a support rep who cannot are not the same threat model, and the more powerful the tools, the tighter the box.

Guardrail 8: Is the agent's network egress limited and scoped by capability? This is the control that stopped the phishing attack above. But a simple "allowed domains" list is not enough on its own. Anthropic learned the hard way that attackers exfiltrated files through an allowed domain by routing them to their own account on it, so think of egress rules as capability grants (what the agent is allowed to do) rather than just a list of addresses it may reach.

Guardrail 9: Is every action logged at the action level for audit? You cannot govern what you cannot see. Every action the agent takes, especially the high-risk ones, should be recorded with enough detail to reconstruct what happened, who or what triggered it, and what it touched. Action-level logs are how you catch a slow problem before it becomes a headline, and how you tighten the other eight guardrails over time.

What does the finished checklist look like?

Here are the nine in one place, each phrased as a yes or no you can verify. A no is a gap to close before launch, not a footnote.

#	Guardrail	You can launch when
1	Action inventory and risk rating	Every action is listed and rated low, medium, or high
2	Human gate on high-risk actions	Every irreversible or costly action pauses for approval
3	Rare, plain-language approvals	The gate fires only on real risk and defaults to "no"
4	Failure-threshold escalation	The agent halts and asks for help after repeated failures
5	Unique agent identity	The agent has its own identity, no shared admin keys
6	Least-privilege permissions	It can reach only what its job needs
7	Sandbox	It runs in established, isolated infrastructure
8	Limited, capability-scoped egress	Network access is fenced, not an open allowed-domains list
9	Action-level audit logs	Every action is recorded and reviewable

Notice the shape. The first four are decisions you make about risk; the last five are controls you build into the environment. The first four can be fooled. The last five cannot be talked out of doing their job, which is why they are non-negotiable even when the model is excellent.

How do I run this audit against a vendor's agent?

The same nine questions work just as well on someone else's agent, and they are the fastest way to tell a contained product from a confident demo. A demo proves the agent works on a good day. The checklist proves what happens on a bad one.

Ask the vendor to answer these in plain language:

Which of my actions require human approval before they run? A vendor who cannot name them has not rated them.
Does the agent get its own identity with least-privilege access, or does it use a shared key into my systems? The second answer is a red flag.
Is it sandboxed, and what can it reach on the network? "It can reach the internet" is not an answer.
Is every action logged, and can I see those logs? If you cannot audit it, you cannot govern it.
What is the blast radius if the agent is tricked? The honest answer is a list of what it can touch, not a promise that it never will be.

If the answers are vague, or rest entirely on "the model is well-behaved," the agent is not contained, no matter how good the demo looked. A good vendor will have answers ready, because these are the same questions they should have asked themselves.

Why does this matter now?

Because the gap between teams that pass this audit and teams that skip it is about to show up in the numbers. Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027, with inadequate risk controls named among the causes, and predicts that by 2028, 25% of enterprise generative-AI applications will suffer at least five minor security incidents a year, up from 9% in 2025. Guardrails are moving from a config setting to a budgeted control plane, and the audit above is how you stay on the right side of that shift.

The encouraging part is that none of this is exotic. Map every action to a risk rating and gate the irreversible. Give the agent its own identity with the least access it needs, in a sandbox, with the network fenced and every action logged. Keep the human gate rare enough that people still read it. Do that, and an attacker who slips past your filter still runs into a wall of capability they were never granted.

If you want these nine guardrails built, verified, and run for you before your first agent goes live, that is exactly the work we do inside other companies' stacks. Book a free consultation below and we will run this checklist against your agent together.

Want this built for you?

We plan, build, and run the AI agents inside your business, with the nine guardrails on this checklist wired in and verified before launch. Book a free consultation.

Book your free consultation

Frequently Asked Questions

01What should I check before letting an AI agent act on its own?+

Run a nine-point audit. Four points map every action the agent can take to a risk rating and gate the irreversible and high-stakes ones behind human approval. Five points verify the environment that contains it: a unique agent identity, least-privilege permissions, a sandbox, limited network egress, and action-level audit logs. If you cannot answer all nine with a clear yes, the agent is not ready to run unsupervised.

02How do I rate the risk of an AI agent's actions?+

Rate each action low, medium, or high using four factors from OpenAI's agent guide: read-only versus write access, whether it can be undone, what account permissions it needs, and what it costs if it goes wrong. Read-only and reversible actions are low risk and can run freely with logging. Anything irreversible, sensitive, or costly is high risk and must pause for explicit human approval before it runs.

03What guardrails do generic AI safety checklists leave out?+

Most checklists list input filters, output validation, and "ask a human for risky actions," then stop. They skip the environmental controls that actually contain a tricked agent: a unique agent identity (no shared admin keys), least-privilege permissions, a sandbox, limited and capability-scoped network egress, and action-level audit logs. Filters judge intent and can be fooled; these controls cap capability and hold even when a filter fails.

04How do I audit a vendor's AI agent before trusting it?+

Ask the vendor to answer the same nine points you would check on your own agent. Which actions require human approval before they run? Does the agent have its own identity with least-privilege access or a shared admin key? Is it sandboxed with limited network egress? Is every action logged for audit? What is the blast radius if it is tricked? If the answers are vague or rest on "the model is well-behaved," the agent is not contained.

05Why is a human approval gate not enough on its own?+

Because people stop reading prompts they see too often. Anthropic found users approve roughly 93% of permission prompts, so an agent that asks about everything is not actually supervised. A gate works only when it fires rarely, on genuinely irreversible or high-stakes actions, and is backed by least-privilege access and a sandbox so a tricked agent still cannot reach what it was never granted.

Related Insights

Trust & Safety

How to Stop Your AI Agent From Doing Something Harmful (Guardrails That Actually Work in 2026)

Stop your AI agent doing harm with layered guardrails: content filters fail on trusted input, so least-privilege isolation and a human gate on the irreversible are what contain it.

Read article

Agentic AI

From Chatbots to AI Agents: Why Autonomy Changes Everything

Chatbots answer. AI agents act. This is why the shift from chat to autonomy is the real change for businesses, and what it means for your team.

Read article

Want this built for you?

We plan, build, and run the AI agents inside your business, with the nine guardrails on this checklist wired in and verified before launch. Book a free consultation.

Book your free consultation All Insights

The AI Agent Safety Checklist: 9 Guardrails to Verify Before You Let an Agent Act on Its Own

How should I use this checklist?

Points 1 to 4: have you mapped every action to a risk rating?

Points 5 to 9: have you verified the environment that contains it?

What does the finished checklist look like?

How do I run this audit against a vendor's agent?

Why does this matter now?

Want this built for you?

Frequently Asked Questions

Related Insights

How to Stop Your AI Agent From Doing Something Harmful (Guardrails That Actually Work in 2026)

From Chatbots to AI Agents: Why Autonomy Changes Everything

Want this built for you?

Innovations

Resources

Company