AI does the work. Humans do the validation.
The temptation is to delegate output and judgement to AI in one move. The discipline is to delegate output but keep judgement human, with explicit validation gates designed into the system. This piece is about how to draw those gates without losing the speed gain that motivated the deployment in the first place.
The asymmetry that breaks naive designs.
When a client deploys an AI agent inside a workflow that used to be done by a person, the speed gain is what gets the deployment approved. A response that took twenty minutes now takes twenty seconds. The case for the investment writes itself in the steering committee, and the project moves forward. The problem we encounter on the follow-up engagement is the version of this story where the speed gain was real but the quality drift was discovered three months later by a customer or a regulator.
The root cause is almost always the same. The system was designed to maximise output speed and the human review step was treated as a checkbox to add at the end. The model produces faster than any human can possibly read, the review queue grows, the reviewer falls behind, and within a few weeks the practical reality is that nothing is reviewed properly. Either the queue is skipped under delivery pressure, or it becomes theatre.
The conversation we tend to have with clients at this point is uncomfortable because it implies the original business case was wrong. It was not. The case is sound. What was wrong was the assumption that you can keep the speed gain and add review on top without redesigning either side. In our experience you can keep the speed gain, but only if the review side is designed with as much care as the generation side.
Three validation patterns we use with clients.
Pre-publication review for content and external communications.
When the AI is producing material that will be seen by a customer, a regulator or the market, the review pattern that works is sequential. The model drafts, a human reviewer accepts, edits or rejects, and only the accepted version leaves the system. The temptation here is to remove the human once accuracy looks good. The pattern that holds up in our work with clients is to keep the gate and reduce the time it takes to clear it. A reviewer working with a well-designed interface can clear a queue many times faster than the original analyst was producing. The gain is real, the gate is still there, and the audit trail is clean.
Spot-check sampling for high-volume production.
When the AI is handling thousands of transactions a day, pre-publication review is not realistic. The pattern we use is risk-weighted sampling. Every output above a defined risk threshold is reviewed. A statistical sample of the rest is reviewed regularly, and the sample is rotated so the model cannot learn which cases will be checked. The team monitors a rejection rate and a drift signal, not individual outputs. Done well, this gives the operations team enough visibility to catch a degradation without bottlenecking the throughput. Done badly, it turns into a dashboard nobody reads.
Human-in-the-loop gates for irreversible actions.
When the AI is taking an action that cannot easily be undone, the rule we apply is firm. Spending money, changing customer status, sending a regulated communication, deleting data. These stop and ask for confirmation. The interface should make the confirmation cheap so the operator does not develop confirmation fatigue. The cost of a wrong autonomous action in these categories is almost always higher than the cost of an extra click, and we have not yet encountered a client who regretted putting these gates in place.
Designing the validation interface.
The most common reason validation fails in production is not policy. It is interface. Reviewers are given a tool that requires three clicks to approve, exposes more information than the eye can scan, and makes editing the output painful. Within a week the reviewer is approving by reflex and the gate has become decorative.
The version that works tends to share a few traits. The output is presented with the underlying source visible, so the reviewer can verify rather than guess. Accepting takes one keystroke. Rejecting takes one keystroke and surfaces a short reason, which feeds back into the evaluation suite. Editing happens in place rather than in a separate screen. The queue is paced so the reviewer is not asked to clear a hundred items in five minutes, which is when fatigue takes over.
The teams we work with who design the interface with the reviewer in the room get a usable gate within the first sprint. The teams who design it without the reviewer tend to rebuild it twice before the rollout, and we have rarely seen a project recover that time.
The validation gate is part of the product, not a layer on top of it. Treated as a layer, it gets removed under pressure. Treated as a feature, it survives.
Measuring validation cost so it does not get cut.
Validation has a cost. Reviewer time, queue latency, infrastructure, training. If that cost is not measured it gets cut the first time delivery pressure spikes, usually quietly, often by a team that did not understand why the gate was there. The discipline we recommend is to instrument validation from the start.
Three numbers tend to be enough. Average time per review, broken down by output type. Rejection rate, watched for drift. Queue latency, watched against the service level agreement. When these three numbers are visible to the operations leader, validation stops being a hidden tax and becomes a managed cost. When they are invisible, the gate is one delivery scare away from being removed.
The clients who measure these numbers tend to be the ones who can have a sensible conversation about removing parts of the loop later. The clients who do not measure them tend to make the removal decision based on intuition, and to regret it within a quarter.
Why this is a design problem, not just a process problem.
The instinct in a regulated environment is to write a policy and assume the policy will hold. In our experience it does not. People work with the system in front of them, not with the document on the intranet. If the system makes review easy, review happens. If the system makes review tedious, review degrades into theatre regardless of what the policy says.
When we discuss this with leadership teams the framing that lands is the simplest. The validation gates are part of the product, not a layer on top of it. They have to be designed, instrumented and operated like any other part of the product. The teams that treat them this way tend to keep the speed gain and the quality. The teams who treat them as compliance overhead tend to lose one or the other within a year.
If a deployment of yours is producing faster than it is being reviewed, the conversation worth having is not about adding reviewers. It is about whether the validation interface, the sampling strategy and the cost instrumentation are doing the job they need to do. The Consulting and Tech Factory teams in our Paris, Dubai, Singapore and Bali offices are reachable from the brief form below. We answer within one working day.
Frequently asked questions.
When can you remove the human validation loop?
When the evaluation suite covers the cases that matter, the rejection rate has been stable below a defined threshold for long enough to be trusted, and the cost of an undetected error is acceptable. The decision is earned, not granted.
How do you avoid validation theatre?
Measure time per review and rejection rate. Pace the queue so reviewers can actually read what they approve. Design the interface so accepting and rejecting both take one keystroke. If a reviewer can clear a hundred items in five minutes, the gate is decorative.
Does validation slow the speed gain that justified the AI deployment?
It slows it less than people assume if the interface is well designed. A reviewer with a good tool can clear many times the throughput of the original analyst. The gain stays large. The gate stays in place.
What roles handle validation?
Usually the operators closest to the workflow. Sometimes a dedicated review team for higher-risk outputs. We avoid putting validation on the engineering team, who lose context fast, or on management, who tend to approve without reading.
How does this map to AI Act high-risk requirements?
The AI Act expects documented human oversight and traceability for high-risk systems. The three validation patterns and the cost instrumentation cover most of what an auditor will ask to see, provided the audit trail is exportable on demand.
How do regulators view spot-check sampling?
Sampling is accepted in most regimes provided it is risk-weighted, documented, and statistically defensible. The detail that matters is whether the sampling rate adjusts when drift is detected. A static rate that never reacts to signal tends to attract questions.
Where this lands
How we'd take this further with you.
Consulting pillar
AI Acceleration
From maturity diagnosis to use case prioritisation to durable adoption across the organisation.
Tech Factory pillar
AI, Agents & Automation
Production-grade agents, evaluation pipelines, observability and the discipline behind shipping AI.
Consulting pillar
Transformation & Organisation
Operating model design, change management, capability build for what comes next.
Writing is one thing. Shipping is the other. Selected work from the partners writing here.
See the workBrief us on your bet.
Tell us about the engine you need first: strategy or Tech Factory. We answer within one working day.
Start a brief
