The pilot-to-production gap is one of the most consistent failure patterns in AI work. A team spends three months building something that works beautifully in a controlled environment, presents it to leadership, gets approval to proceed — and then eighteen months later the system still isn't running in production.
The reasons are predictable. Not because they're hard to anticipate, but because pilot success criteria and production success criteria are fundamentally different, and most AI projects don't acknowledge that until it's too late.
What pilots optimise for
Pilots optimise for impressiveness. The demo is the deliverable. Clean inputs, controlled data, the happy path. The system shows what it can do when everything goes right. This is a reasonable way to validate a concept — but it produces systems that are optimised for the demo, not the edge cases that make up a non-trivial fraction of real-world volume.
What production requires
Production requires different things entirely: graceful handling of malformed inputs, integration with systems that behave inconsistently, observable failure modes, a way for the team that operates it to understand when it's doing the wrong thing.
None of these are glamorous. None of them show well in a demo. All of them determine whether the system runs for eighteen months or gets quietly abandoned after the first real-world incident.
How to build for production from the start
The answer is not to skip the pilot — it's to define production criteria before the pilot starts. What does this system need to handle that the demo won't? Where are the integration points that will cause problems? What does failure look like, and who sees it?
Build for those criteria in the pilot. The demo might be less impressive. The production system will actually run.
[TODO: refine voice — expand the production criteria checklist; add a specific example of an edge case that kills a system post-handoff; the "observable failure modes" point needs a concrete illustration]