B One Consulting
·

From PoC to production. How to industrialise AI agents.

Through our exchanges with clients, partners and leadership teams across our offices in Paris, Dubai, Singapore and Bali, one situation comes up more often than any other: a working AI pilot that has not made it to production. The technical side is rarely what we end up advising on. The work that decides whether an agent ships and stays shipped happens around the model, in the operating discipline, in the data, and with the people who will live with the system every day.

The pilot that worked. The agent that never shipped.

When a brief lands on one of our desks and we sit down with the team behind it, the conversation tends to follow a similar shape. A client team has spent the better part of a year on an AI initiative. The pilot was shown to the executive committee, the demo was clean, somebody committed to a launch in the next quarter, and several quarters later nothing has made it to production. The proof of concept has not exactly been killed. It is sitting on a developer's laptop, the operators it was supposed to help never logged in, and the leader who sponsored it has moved on to a different priority.

In our exchanges with clients and technology partners we see this pattern often enough that it has stopped surprising anyone in the room. The reasons we encounter are usually a combination of the five described below, and most teams we work with recognise themselves in more than one of them.

The skill set that built the pilot is not the skill set that ships it.

In the laboratory, the model is evaluated on accuracy against a held-out dataset. In production, what matters is latency under load, reliability when an upstream API changes its behaviour, and cost per session when usage doubles. We have walked into rooms with extremely capable data science teams who had not been asked any of those three questions during the pilot phase, and who would not have prioritised them even if they had. The way we typically address this with a client is to pair the data scientist with a platform engineer from the first week of the production work, which feels heavy at first and tends to prove necessary by the second sprint.

Operator trust was never designed into the system.

A model that produces useful answers some of the time is acceptable in a demo and unworkable in daily use. The operators we have sat with describe a consistent threshold for trust: if they cannot see why the agent recommended something, cannot override a wrong answer easily, and cannot trace what the agent did three days ago when something went wrong, they stop using it within a few weeks. The work we end up doing with clients at this point is not about model accuracy. It is about whether the system has been designed to be trustworthy from the inside, which is a different conversation entirely.

The agent had no owner once the pilot phase closed.

Most pilots we see were funded out of an innovation budget and championed by a chief data officer, a head of digital, or a curious leader on the business side. When the pilot phase finishes, that owner gets pulled into the next thing. The operations team that runs the workflow the agent was supposed to change was rarely consulted in the build phase, and they do not accept the agent into their portfolio. There is no service level agreement, no on-call rotation, no improvement budget. Within a quarter the agent has rotted quietly and nobody has the political weight to revive it.

Governance arrived too late to shape the design.

In regulated environments, and increasingly in every environment where personal data, financial decisions or safety considerations are involved, we see the same sequence. Legal, security and compliance are not in the room during the design phase. They show up later with a long review document that surfaces a constraint that should have been a design input from the start. The team is then facing months of rework or a quiet termination. Bringing those functions into the conversation before the first prompt is written is one of the highest-leverage moves we recommend.

Integration is where the model is judged unfairly.

The model is impressive in isolation. Once it is wired into the customer relationship management system, the data warehouse, the ticketing system and the internal authentication layer, edge cases multiply and latencies stack. The agent looks dumber than it actually is because the plumbing around it leaks. We have seen many teams blame the model for problems that were really integration problems, and lose months looking in the wrong place.

None of this is bad luck. It is the consequence of a structure that optimises for the moment of the demo rather than the moment of daily use, and it is the structure we spend most of our time helping clients change.

What production-grade actually means for an AI agent.

When the question comes up in a steering committee or in our discussions with technology partners, the working definition we tend to use has held up across our engagements. An AI agent is in production when an operator who has never heard of it can be onboarded in an afternoon, complete a real task with it, get a result they trust, and the team running the agent can detect and respond to a regression within hours of it happening.

That sentence carries five conditions, and in our experience all five need to hold for an agent to survive its first quarter of real use.

An evaluation suite that runs on every change.

By evaluation suite we mean a test set drawn from real production traffic, with expected outputs, that runs every time the team changes a prompt, swaps a model or updates a retrieval index. We have worked with teams who shipped without one and learned about regressions from angry users a week later. We have worked with teams who invested in the eval suite from the first sprint and were able to swap their underlying model three times during the engagement without losing the trust of the operators. The difference is significant and underestimated.

Observability wired in from the first sprint.

The team needs to see what the agent is actually doing. Traces of every call, latencies, token counts, tool invocations, fallback rates, cost per session. When something goes wrong, somebody needs to be able to replay yesterday's conversation. In the briefs we receive after a stalled pilot, treating observability as a feature to add later is one of the most common mistakes we find. It is plumbing, and plumbing is much harder to retrofit than to put in from the start.

An operator interface that explains, allows override and audits.

The three patterns that build operator trust faster than any accuracy number are visible reasoning, override paths and audit trails. The agent shows where its answer came from. The operator can disagree, correct the answer and have that correction land back in the evaluation set. Every action the agent takes is logged in a way the operator and the compliance team can both inspect. Clients sometimes ask us if these patterns slow the system down. In our experience they slow the demo and dramatically accelerate the adoption.

Governance as a runtime constraint, not a launch checkbox.

Tool access is least-privileged. There are guardrails on what the agent can write to, who it can email, what it can spend in a session. There is a kill switch that does not require a code deployment to flip, because in our experience the moment you need a kill switch is the moment you cannot afford to schedule a release. There is an audit trail compliance can inspect on a Monday morning without anyone going on call to help them.

Cost and latency budgets enforced at runtime.

Every call has a budget. The system degrades gracefully when the budget is exceeded, falling back to a shorter context window, a cheaper model or a simpler response rather than producing an unbounded cost. The clients we work with who have ignored this discovered the agent was consuming a disproportionate part of the AI line item from a finance report at the end of the quarter, which is not how anyone wants that conversation to go.

When all five conditions are in place, the agent has shifted from a deployment to a system in production. That is the threshold worth aiming for.

The gap between a working notebook and a working agent is the unglamorous engineering that nobody promotes. It is also the part that decides whether your investment in AI returns anything to the business.

Three habits the teams that succeed share.

Across the engagements we run from our offices in Paris, Dubai, Singapore and Bali, and in our work alongside our technology partners, the teams that get an agent into real daily use, and keep it there, tend to share three habits that distinguish them from the teams that do not. None of these are technical innovations. They are habits of how the work is framed.

They start the conversation with the operator, not the data scientist.

The first meeting on a successful agent project is rarely about what data is available. It is about what decision the operator makes every day, where the friction lives in that decision, and what would actually make the next two hours of their work easier. The model is a means. The decision is the end. When the first conversation starts with data instead of with decisions, we tend to end up with an agent that solves a problem nobody on the floor was asking to solve. We have rebuilt the use case shortlist in the first three weeks of more than one engagement, simply because the original shortlist came from a vendor's slide rather than from operator conversations.

They build the evaluation suite before they build the agent.

The evaluation set works as the specification. It is the list of cases the agent must handle, drawn from real operator examples, with the expected outputs. Writing it first forces a level of clarity about what good looks like that prompt engineering on its own does not produce. It also gives the team a feedback signal that survives every prompt rewrite, every model upgrade and every retrieval tweak that will inevitably come during the work.

They hand ownership to operations before launch day.

In the projects that work, the team that will run the agent in production is in the room before the agent exists. They write the runbook with the build team. They co-design the alerting. They name the on-call rotation. They define the service level agreement. When launch day arrives, there is no handover in the formal sense. The agent is already theirs. The projects we have seen fail almost always treat the handover as the final step rather than the first one.

The two prerequisites we always check first: people and data.

Before we get into any of the engineering discipline described above, there are two foundations we always look at first, because in our experience no amount of work on the agent itself can compensate for weakness on either. The first is the data the agent will rely on. The second is the people who will work with it. We have watched excellent engineering teams build agents on data nobody trusted, and we have watched good models deployed into teams that were never consulted. Both kinds of projects stall in roughly the same way.

On the data side, what we look for is trust, not volume.

The pattern we see most often is that the data exists, the volume is sufficient, the technical access can be arranged, and the gap is governance. Who decides what the canonical source is. Who is allowed to feed the model. Who is accountable when the data drifts. We have arrived on more than one engagement where the first useful piece of work was not a model but a data ownership review, surfacing an unresolved disagreement between the team that owns the customer-facing system and the team that owns the warehouse. Until that disagreement is named and resolved at the right level, the agent built on contested data tends not to survive its first compliance review.

For that reason we always recommend resolving questions of data trust before the build begins, even if it adds weeks at the start. The time recovered later, in fewer reworks, fewer compliance escalations and fewer arguments about what the agent should have known, is usually significant.

On the people side, what we look for is inclusion, not training.

Operator inclusion is often treated as a training problem when it is closer to a design problem. The people who will use the agent every day should shape what it does, what it shows, when it interrupts them and how it admits uncertainty. In our discussions with leadership teams who have lived through a stalled rollout, the lesson is usually the same: the most useful early activity is rarely a training plan. It is a small number of operator interviews that surface the friction in the current work, the moments where they would welcome help, and the moments where they would not. From those conversations the use case shortlist usually rewrites itself, sometimes significantly compared to the original specification.

When operators are treated as co-designers, the rollout tends to move faster once it starts. When they are treated as a future audience for a demo, adoption stalls in the first weeks of real use. The investment in training that follows a serious design effort is meaningful. The investment in training that replaces a design effort tends to be wasted.

When the conversation with a client or a partner turns to "what would you advise before we invest seriously in an AI agent", we tend to give the same answer. Spend the first weeks on data ownership and operator inclusion before you spend any time choosing a model or a vendor. The rollouts we have seen succeed had those two foundations in place by the time the build started. The rollouts that stalled were generally missing one of them, sometimes both, and we have not yet seen an exception.

A short checklist we run before production.

When a client team, or one of our technology partners on a joint engagement, is approaching the point of putting an agent in front of real operators, we tend to walk through the following ten questions together. A "no" on any of them is not necessarily a reason to delay. It is a risk worth naming and accepting on purpose, rather than discovering after the fact.

  1. Is there an evaluation set that covers the top cases the agent is expected to handle in the first month?
  2. Is that evaluation set fed by real production traffic, or has it been frozen since the first sprint?
  3. Are cost and latency budgets defined, instrumented and enforced at runtime, not just on a dashboard?
  4. Can someone on the team replay any conversation that happened yesterday and understand what the agent did?
  5. Do operators have a way to disagree with the agent, and does the disagreement land back in the evaluation set?
  6. Is there a kill switch that does not require a code deployment, accessible to operations rather than only engineering?
  7. Are the tools the agent can use least-privileged, and is access audited?
  8. Is there a named owner for the agent past launch day, with a service level agreement, an on-call rotation and an improvement budget?
  9. Were legal, security and compliance brought into the design phase, or only into the audit phase at the end?
  10. If the underlying model is deprecated or repriced significantly next quarter, what is the fallback?

In these conversations, what we listen for is not the questions the team answers "no" to. It is the questions where they hesitate. Hesitation usually points to a soft spot in the operating model that is worth strengthening before the rollout widens.

Where we tend to start the conversation.

If your team has a pilot that should be in production and has not made the jump, the questions we usually open with in our first exchange are not about the model. We ask about the workflow the agent was supposed to change, who runs that workflow today, and whether they were in the room when the agent was designed. We ask about the data the agent relies on, who decides what the authoritative version is, and what happens when those people disagree. We ask who will be accountable for the agent six months after launch, and whether they have a service level agreement, an on-call rotation and a budget for the improvements they will need to make.

The clients we work with who have a clear answer to those three questions tend to be a few months from production. The clients who do not have clear answers tend to be in a longer conversation than they realised, and that is usually a useful thing to find out early rather than late.

If you would like to compare notes on a pilot that is stuck, the Tech Factory and Consulting teams in our Paris, Dubai, Singapore and Bali offices are reachable from the brief form below. We answer within one working day, with the partner who will sit on the file rather than a relationship manager.

Frequently asked questions.

What does it actually mean for an AI agent to be in production?

An AI agent is in production when an operator who has not heard of it can be onboarded, complete a real task, trust the result, and the team running it can detect and respond to a regression within hours. Production-grade requires an evaluation suite, observability, governance, cost and latency budgets, and a named owner with an SLA.

Why do AI agent pilots fail to reach production?

The model is rarely the blocker. Pilots stall because the team did not build an evaluation suite, never wired observability, treated governance as a final-stage gate, did not co-design with operators, or had no owner past the pilot phase. The lab works. The organisation does not.

What is an AI evaluation suite and why does it matter?

An evaluation suite is a growing set of real cases the agent must handle, with expected outputs, that runs on every prompt change, model swap or retrieval update. It is the specification for what the agent should do. Without it, every change is a gamble and regressions are discovered by users, not engineers.

How do we build operator trust in an AI agent?

Three patterns. The agent shows where its answer came from. The operator can disagree and feed the correction back into the evaluation set. Every action is logged for audit. Trust is built by visible reasoning, override paths and accountability, not by accuracy numbers or demos.

Who should own an AI agent once it is in production?

A named owner inside the team that runs the workflow, with an SLA, on-call rotation and an improvement budget. Innovation teams should hand off ownership before launch, not after. Agents without operational owners rot quietly within a quarter.

How do you measure cost and latency for an enterprise AI agent?

Instrument every call with token counts, latency, tool invocations and total cost. Set a budget per session and degrade gracefully when exceeded: shorter context, cheaper model, simpler response. Without runtime enforcement, a small percentage of expensive sessions can consume most of the budget.

Further reading

Where this lands

How we'd take this further with you.

Tech Factory pillar

AI, Agents & Automation

Production-grade agents, evaluation pipelines, observability and the discipline behind shipping AI.

Consulting pillar

AI Acceleration

From maturity diagnosis to use case prioritisation to durable adoption across the organisation.

Case study

Giza · Innovation governance case

Innovation governance platform with 360-degree value, risk and readiness visualisation.

Brief us on your bet.

Tell us about the engine you need first: strategy or Tech Factory. We answer within one working day.

Start a brief

Brief us
We'll take it from there.

Tell us the decision you're trying to make. Strategy, transformation, performance or AI. We answer within one working day.