To embed AI into the product you ship, start from a specific painful task, not a model, and reach for the simplest mechanism that works: a single well-prompted LLM call with retrieval before a workflow, and a workflow (predefined code paths) before a true agent (the model directs its own steps). Use AI only where deterministic code falls short, build the prototype on the most capable model to set a quality baseline and then trade down for cost, and ground every feature in evaluations plus layered guardrails sized to the cost of being wrong. That is the whole playbook, and the rest of this article is the detail behind it. Most teams stall not because the technology is immature, but because the two halves of the job, deciding what to embed and operating it after launch, are written about by different people who never connect the dots.
If you would rather we do this for you, see how we run AI product integration, where we plan, build, and operate embedded AI features inside other companies' products. Everything below is yours to use on your own first.
Why do so many AI features stall before they ship?
The numbers tell a strange story. Adoption is nearly universal: 88% of organizations now say they use AI in at least one business function, up from 78% a year earlier. Yet value at scale is rare. Fewer than 10% of companies are scaling AI agents in any single function, only about 5.5% attribute more than 5% of their EBIT to AI, and a striking 73% of product-development teams are not using AI agents at all. Usage is everywhere; durable value is not.
The reason is rarely the model. McKinsey's reading of the benchmark is that high performers capture disproportionate value by redesigning the workflow around AI instead of bolting it on, and they are nearly three times more likely to have fundamentally redesigned how the work happens. The lab guides from Anthropic and OpenAI say the same thing from the engineering side: the failures come from over-building, skipping evaluation, and shipping without guardrails, not from a weak model. So this playbook attacks the real failure points: choosing the wrong task, choosing too much machinery, and ignoring the production layer where features live or die.
How do you decide what to embed AI into?
Start from a painful task in your product, then run it through two filters before you write a line of code.
The first is the 30-second rule. If a person can finish the task in under 30 seconds, you almost certainly need better UX, not AI. AI earns its place on work that is slow, judgment-heavy, or buried in unstructured text, not on a button that should have existed already. Adding an LLM to a fast, deterministic task just makes it slower, more expensive, and less predictable.
The second filter is the set of signals that tell you AI is the right tool at all. OpenAI's guidance is to reach for AI, rather than a deterministic script, when you hit one of three things: complex decision-making that needs genuine judgment, rules that have grown too tangled to maintain, or heavy reliance on unstructured data like emails, PDFs, and chat logs. If none of those apply, a normal feature or a plain script will be faster, cheaper, and more reliable. The cheapest AI feature is the one you correctly decide not to build.
Then size the effort by the cost of being wrong. A feature that drafts a reply a human will read and edit can ship light. A feature that sends money, deletes data, or speaks to a customer with no human in the loop needs far more evaluation, far stronger guardrails, and probably an approval step. The cost of being wrong is the single dial that sets how much machinery you put behind the feature, and it is the question most teams skip until something breaks in production.
What is the simplest mechanism that works?
Once you know the task is worth it, pick the smallest mechanism that does the job. Anthropic's guidance, from the lab that builds these systems, is blunt and worth repeating: find the simplest solution possible and only increase complexity when it demonstrably improves the outcome. There is a natural ladder here, and you should climb it only as far as you must.
- A single LLM call. One well-written prompt, often with retrieval (pulling in the right context from your data) and a few in-context examples. This handles a surprising share of useful features: summarize this thread, classify this ticket, extract these fields, draft this reply. Many successful products never go past this rung.
- A workflow. An LLM and tools run through predefined code paths that you wrote. The model fills in the hard, fuzzy steps; your code controls the flow. Workflows cover predictable multi-step tasks and are where most "AI features" actually belong. Common patterns include prompt chaining (break a task into ordered steps), routing (classify the input, then send it to the right handler), parallelization (run sub-tasks at once and combine), orchestrator-workers (one call plans, others execute), and evaluator-optimizer (one call drafts, another critiques and improves).
- A true agent. Here the model directs its own process and tool use, deciding the steps at runtime rather than following a path you predefined. Agents are for open-ended work where you genuinely cannot enumerate the steps in advance, and they trade higher latency and cost for that flexibility.
The honest framing: most things sold as agents are really workflows, and that is fine. Workflows are predictable, cheaper, and easier to debug. Reach for a real agent only when flexibility is the point and the task cannot be expressed as a path. For the full decision tree, see our companion piece on LLM vs workflow vs agent: which to build.
| Mechanism | Use it when | What controls the steps | Cost and latency |
|---|---|---|---|
| Single LLM call | One bounded task: summarize, classify, extract, draft | Your single prompt | Lowest |
| Workflow | Predictable multi-step task you can map in advance | Your code (predefined paths) | Moderate, predictable |
| Agent | Open-ended task where steps cannot be enumerated | The model, at runtime | Highest, variable |
How do you pick and use the model?
Build the prototype on the most capable model you can, deliberately, to set a performance baseline. You want to know what "good" looks like when the model is not the bottleneck, so that any quality gap you see later is a problem with your prompt, your data, or your design, not the engine. Once the feature works on the strong model and you have evals (more on those next), try swapping in smaller, cheaper models and watch whether quality holds on your real test cases. Often it does, and you have just cut your run cost without touching the user experience.
This is now normal practice, not a niche optimization. Many teams run several models in production and route each task to the cheapest one that passes: 37% of enterprises already use five or more models, up from 29%. The strong model handles the hard cases, a cheaper one handles the easy majority, and your code routes between them.
The deeper point is that the model is the least defensible part of an AI feature. Any competitor can call the same API. What they cannot easily copy is your proprietary data, your domain knowledge baked into the prompts and retrieval, your evals, and your UX. Spend your effort there, not on chasing the newest model into your stack. Build-versus-buy has shifted for this reason: as the ecosystem matures, more teams buy third-party AI capabilities instead of custom-building (over 90% of enterprises are now testing third-party customer-support apps), and only the work specific to you stays in-house.
How many evals and guardrails do you actually need?
Evals are the part everyone underestimates and the part that decides whether a feature survives. Treat your AI feature like software that needs regression tests, because that is exactly what it is. Without evals you are shipping changes blind, and an LLM feature that worked last week can quietly degrade after a prompt tweak or a model update with no error to warn you.
Build a golden dataset of real test cases, each with a known good outcome. A practical starting mix is about 50% happy path, 20% edge cases, 15% adversarial inputs (people trying to break it), and 15% stress cases, beginning with roughly 50 to 200 cases and growing it as you learn. Then layer your evaluation in three tiers, cheapest first:
- Code assertions. Deterministic checks: did it return valid JSON, hit the required fields, stay in the allowed length, avoid forbidden strings. Fast, free, and catch the dumb failures.
- LLM-as-judge. A separate model call scores the output against a rubric for the fuzzy qualities code cannot check, like tone, relevance, and faithfulness to the source.
- Human review. In high-stakes domains, sample 5 to 10% of live outputs for a person to inspect. This is your ground truth and the source of new golden cases.
Guardrails are the runtime defense, and they should be layered the way the labs recommend, not a single check. In production that means input filtering (catch unsafe or out-of-scope requests and PII before they reach the model), output validation (verify the response before it reaches the user or an action), limits on high-risk tool calls, moderation checks, and a human-in-the-loop on anything sensitive. Crucially, size all of this to the cost of being wrong. A draft a human will edit needs little. An action that moves money, deletes records, or talks to a customer unattended needs the full stack plus an approval step. Guardrails and evals are not optional polish; they are the difference between a demo and a product, and where teams that skip the production layer become stalled pilots.
What does it cost to run after launch?
This is the question the architecture guides never answer, and the part we care about most, because we operate these features, not just build them. The build is cheap. The run is where the money and the failures accumulate, across four ongoing line items:
- Inference (tokens). Every call costs money in proportion to the context you send and the output you generate. This scales with usage, so a feature that is fine at pilot volume can get expensive at full traffic.
- Retrieval and infrastructure. The vector store, search, caching layer, and the glue that pulls the right context into each call.
- Monitoring. Logging every input and output, tracking cost, latency, and failure rates, and watching for drift as your data, prompts, and the underlying model change underneath you.
- Human review. The sampled review and the approvals you keep in the loop for high-stakes actions. This is a real, recurring cost, and it is the right cost when the alternative is an unattended mistake.
The levers to control run cost are the ones you set up earlier. Route easy cases to a cheaper model and reserve the strong one for hard cases. Cache repeated calls. Trim the context you send to what the task needs. And price the feature on the value it delivers, not on tokens, so a feature that saves an hour of skilled work can comfortably carry its inference bill. Treat LLM spend as a permanent budget line, not a one-off experiment: across the market, innovation budgets have fallen from 25% to 7% of LLM spend as AI moved onto permanent budgets, and total LLM budgets are expected to grow about 75% in the year ahead. Plan for the run, not just the launch.
A step-by-step plan to embed your first AI feature
Putting the whole playbook in order, here is the sequence that keeps a feature out of the stalled-pilot pile:
- Pick one painful task. Specific, real, and slow or judgment-heavy. Run the 30-second rule and the three AI signals before committing.
- Size the cost of being wrong. Decide up front how much evaluation, guardrail, and human oversight the stakes demand.
- Choose the smallest mechanism. Try a single LLM call with retrieval first. Move to a workflow only when the task is genuinely multi-step, and to an agent only when steps cannot be predefined.
- Prototype on the strongest model. Set the quality baseline, then prove it on a golden dataset of 50 to 200 real cases.
- Add layered guardrails and evals. Code assertions, an LLM judge, and sampled human review, all scaled to the stakes.
- Trade down for cost. Swap in cheaper models where quality holds, add routing and caching, and confirm against the same evals.
- Instrument and run it. Log everything, watch cost, latency, and failure rates, and expect to maintain it as the model and your data move.
Common mistakes when embedding AI features
- Starting from the model, not the task. "We should add AI" is not a feature. Start from a painful job to be done and let it tell you whether AI even belongs.
- Reaching for an agent first. The most common over-build. A workflow or a single call is usually simpler, cheaper, and more reliable, and most things called agents are workflows anyway.
- Shipping without evals. No golden dataset means no way to know a change made things worse. This is the quiet killer of AI features in production.
- One-size guardrails. Either nothing, which is dangerous, or heavy human review on a low-stakes draft, which is wasteful. Size guardrails to the cost of being wrong.
- Forgetting the run. Budgeting only for the build and getting surprised by token, monitoring, and review costs. The run is the real cost center.
- Over-investing in model choice. The model is the least defensible part. Your data, evals, and UX are the moat.
Build it in-house or have it built for you?
You can prototype any of this fast. The gap that stalls teams is the production layer: wiring the feature to your real, messy systems, writing and maintaining evals, layering guardrails, keeping a human in the loop, and operating it all as cost and quality drift over time. That is the part 73% of product teams have not crossed, and it is organizational, not technical. If your team has the engineering depth and the appetite to own evals and the run, build it. If you want the feature shipped and operated reliably without becoming an AI engineering org first, that is when a done-for-you partner earns its place. We cover the full decision in our build vs buy guide for custom AI agents, and the hands-on version in how to build a custom AI agent for your business.
If the production section made this look heavier than you want to carry, that is the honest signal. We plan, build, and run embedded AI features inside other companies' products on the simplest pattern that works, with the evals, guardrails, and monitoring that keep them reliable and affordable after launch. Book a free consultation below and we will map your first AI feature and the right mechanism together.
