AI agents break in production for four recurring reasons, and none of them is the model getting dumber. They run out of context mid-task (the short-term memory fills up and the agent loses the thread), they hallucinate tool calls (invent an action or argument that does not exist), they choke on brittle tool interfaces (a vague tool definition produces a confused, error-prone agent), and they drift in memory (lose the plan and the facts over a long run). Every clean explainer describes the agent loop as if it runs itself. It does not. The loop is easy to draw and hard to keep alive, and the gap between those two things is where most agent projects stall.
This article goes into that gap. We will walk each failure mode, show what it looks like in practice, and point at the fix, using the few real numbers that exist on the subject. If you would rather we do this for you, see how we run generative AI architecture, but everything here is yours to use on your own.
What actually breaks: the loop is not self-running
A quick refresher, because the failures map directly onto the parts. An AI agent is a large language model in a loop: it plans (breaks a goal into steps), acts (calls a tool to read data, run code, or message a system), observes the result from the real environment, and repeats until the goal is met. The model reasons, tools are its hands, and memory makes it stateful, because an LLM on its own is stateless and forgets everything once its context window fills.
Each of those parts has a failure mode:
- The plan can be wrong, or the agent can stop adapting it.
- The tool call can be hallucinated, or aimed at a tool that is badly defined.
- The memory can exhaust (short-term) or drift (long-term).
- The observe step can be skipped, so the agent never learns the last action failed.
These failures surprise people because the explainers stop at the diagram. Anthropic, AWS, and Google describe the loop and never mention who keeps it running. Let us do that part.
Why do agents run out of context mid-task?
This is the most common production failure and the easiest to miss in a demo. The context window is the agent's short-term memory: the live record of the conversation, the plan, and every tool result so far. It is finite. On a long, multi-step task, it fills up. When it does, earlier content gets pushed out, and the agent loses the very steps it needs to finish the job. It forgets the plan it made, the customer detail from step two, the result of the tool it called five steps ago.
The naive fix is to keep stuffing everything into the prompt. That is exactly what causes the overflow. The real fix is active context management: clearing stale content as you go so the window holds what matters.
Anthropic shipped a concrete version of this, called context editing, that automatically clears outdated tool calls and results as the model approaches its token limit. The numbers are the clearest signal in the whole field:
- In a 100-turn web-search evaluation, context editing cut token consumption by 84% and let agents finish workflows that would otherwise have failed from context exhaustion.
- Context editing alone improved task performance by 29% on Anthropic's internal multi-step evaluation.
Read that first number again. An agent that finishes the job and one that dies halfway can be the same model, separated only by whether someone managed its context. That is an engineering decision, made before the agent ever talks to a user.
Why do agents hallucinate tool calls?
Tools are how the agent touches the world: it calls a function to fetch a record, run code, send an email, or query a database. A hallucinated tool call is when the model invents a tool that does not exist, calls a real one with arguments it made up, or formats the call so it silently fails. The agent then carries on as if the action succeeded, and the error compounds down the rest of the run.
This clusters around two causes:
- Vague or overlapping tool definitions. If two tools do similar things, or a tool's description does not make its inputs obvious, the model guesses. Anthropic names this directly: one of its three core design principles is to carefully craft the agent-computer interface (the ACI). A vague, badly documented tool produces a confused, error-prone agent. A precise one produces a reliable agent. Tool design is not a checkbox, it is the work.
- No ground truth from the environment. Anthropic stresses that the agent must gain ground truth from the environment at each step. If a failed tool call returns nothing useful, the agent never learns it failed and keeps building on a fantasy. The observe step has to surface real errors, in a form the model can act on.
The fix is an engineering job: name tools clearly, document every argument, return honest results including failures, and validate inputs at the boundary so a malformed call is caught instead of guessed.
What is memory drift, and why does it sneak up on you?
Context exhaustion is the loud failure. Memory drift is the quiet one. Over a long task, or across sessions, the agent gradually loses the thread: it half-remembers the plan, mixes up two customers, repeats a step it already finished, or forgets a rule it was told an hour ago. Nothing crashes. The output just slowly stops being correct.
Drift happens because an LLM is stateless. It does not inherently remember anything between turns, so each step starts from whatever is currently in the window. Short-term memory (the window) is volatile and finite. Long-term memory is what is supposed to hold the durable stuff:
| Memory type | What it holds | What drifts when it is missing |
|---|---|---|
| Short-term | The live window: current conversation and recent tool results | The agent loses the plan and recent steps as the window fills |
| Long-term: episodic | Specific past events | It forgets what happened in a customer's earlier ticket |
| Long-term: semantic | Structured facts, definitions, and rules | It contradicts your policies or your product catalog |
| Long-term: procedural | Learned skills and step-by-step behaviors | It runs your refund process out of order, or skips a step |
The cure is to stop relying on the prompt as the only memory. Anthropic's memory tool is a file-based system: the model can create, read, update, and delete files in a dedicated memory directory that persists across conversations and lives outside the context window. The agent consults it when needed instead of carrying everything in the window at once. Paired with context editing, the combination is where the headline result comes from:
- Memory tool plus context editing improved agentic-search performance by 39% over baseline on Anthropic's internal multi-step evaluation.
A 39% lift is not a tuning detail. It is the difference between an agent you can trust on real work and one that quietly drifts until someone notices the numbers are wrong.
Why do agents make bad plans, or stop adapting them?
Planning failures come in two shapes. The first is a bad plan up front: the agent decomposes the goal into the wrong steps, or the wrong order. The second is subtler and more dangerous: the agent makes a reasonable plan and then refuses to change it when reality disagrees.
IBM draws a useful line here between planning agents, which anticipate future states and generate a full action plan before they execute, and reactive agents, which respond one step at a time. Pure planners commit to a plan and march off a cliff when the world does not match it. Pure reactors never see two steps ahead. Useful agents blend both: sketch a plan, then adapt it as the observe step feeds back ground truth.
That blend only works if the observe step is real. If the agent never genuinely reads the result of its last action, it cannot adapt, and it runs a stale plan to the bitter end. This is why Anthropic's guidance leans toward simplicity: start with the simplest thing that works, often a fixed workflow rather than a fully autonomous agent, because a predictable workflow you control beats an agent that plans confidently and adapts badly. Reaching for full autonomy when a workflow would do is itself a common cause of failure.
So what does it actually take to keep an agent running?
Put the failures next to their fixes and a pattern appears. Every fix is ongoing engineering, not a one-time setup.
| Failure mode | What it looks like | What keeps it from happening |
|---|---|---|
| Context exhaustion | Agent dies or loops near the end of a long task | Active context management (clear stale tokens as you go) |
| Hallucinated tool calls | Agent invents tools or arguments, carries on as if they worked | Clear tool definitions, validated inputs, honest error feedback |
| Memory drift | Output slowly stops being correct over a long run | A persistent long-term memory outside the window |
| Bad or stale plans | Agent decomposes wrong, or ignores new information | Real observe step, and a workflow when full autonomy is not needed |
| No ground truth | Agent builds on a failed action it never noticed | Feedback from the environment surfaced at every step |
None of this is model magic. The model is the same LLM you use in a chat window. What makes an agent reliable is the wrapper around it: the tool interface, the memory architecture, the context-management strategy, and the evaluation loop that tells you whether the agent is improving or quietly drifting. That wrapper is not write-once. Tools change, data changes, the workload grows, and an agent that worked last month starts failing in new ways. Keeping it healthy is a job, not a deploy.
That is the part most companies cannot staff, and it is exactly the work we do. We plan, build, and run the agents (the tool interfaces, the memory, the context strategy, and the evals) inside your business, so you get a system that holds up in production instead of a pilot that stalled at the demo. You can see the shape of that work on our generative AI architecture service.
If you have an agent that works in the demo and breaks in the real world, or you want to skip that stage entirely, book a free consultation below and we will find the failure mode and fix it with you.