Why Model Deployment Is Different from Software Deployment
Software deployment has a clear success criterion: does the application run without errors? Model deployment has an additional dimension: does the model produce correct outputs? A model can deploy successfully at the infrastructure level while producing predictions that are meaningfully worse than its predecessor. Standard deployment tooling does not detect this. Only business metric monitoring does.
This distinction shapes every deployment strategy decision. The validation gates for model deployment are more complex than for software deployment. The rollback triggers are different. The definition of "working correctly" requires business context, not just technical health checks. Teams that treat model deployment as a subset of software deployment will experience production failures that their monitoring did not predict and their runbooks did not address.
A model deployment strategy is not a technical protocol. It is a risk management framework. The right strategy for a model that informs patient triage decisions is different from the right strategy for a model that personalises product recommendations. Match the strategy to the blast radius.
The Four Core Deployment Strategies
These four strategies cover the deployment scenarios encountered across enterprise AI programs. In practice, teams layer them: shadow mode builds confidence before canary, canary validates business metrics before full rollout, blue/green enables instant rollback.
Matching Strategy to Risk Profile
The selection criteria for deployment strategy are not primarily technical. They follow from three questions: what is the blast radius if the new model is worse? How quickly can you detect degradation? And what is the cost of delayed rollout?
| Factor | Blue/Green | Canary | Shadow | Feature Flag |
|---|---|---|---|---|
| High stakes (errors costly or irreversible) | Partial | Partial | Yes | Partial |
| Instant rollback required | Yes | No | Yes (N/A) | No |
| Business metric validation needed | No | Yes | No (offline) | Yes |
| Low additional infrastructure cost | Brief 2x | Yes | No (2x) | Yes |
| Works without flag infrastructure | Yes | Yes | Yes | No |
| Cohort-level experiment control | No | No | No | Yes |
In practice, the most common enterprise deployment sequence is: shadow mode for pre-production confidence, followed by canary for production validation, with blue/green as the underlying infrastructure to enable rapid rollback if the canary metrics go negative. Feature flags are layered on for teams that need cohort-level precision beyond what standard canary provides.
The Deployment Gates: What to Check Before Each Stage
Every deployment stage transition requires explicit gates. The temptation to skip gates when the model "looks good" is highest precisely when it should be resisted. The model that skipped the evaluation gate is the one that causes the 2am incident.
The Rollback Plan: Practice Before You Need It
Every model deployment plan must include a rollback plan that has been tested in staging. A rollback plan that exists only in a document and has never been executed is a false comfort. The sequence below should be practiced quarterly as part of the operational readiness programme for every production model.
Automating Deployment Decisions
Manual deployment approval gates are the right starting point. As teams build confidence in their metrics and thresholds, progressive automation is possible and eventually necessary to support a portfolio of 20 or more production models.
The components that can be automated safely: data validation before training, offline evaluation against the champion model, performance benchmarking against the latency SLA, and canary stage progression if all metrics are within defined bounds. The components that should retain human approval even in mature programmes: the final gate to 100% traffic for models with high blast radius, any deployment that is outside normal parameters, and any deployment following a recent rollback.
The automation framework for deployment decisions integrates with the ML CI/CD pipeline described in the production platform guide. The pipeline makes decisions based on metric thresholds; humans set the thresholds and review exceptions. This separation of concerns scales while maintaining accountability.
Deployment Strategy and Governance
Deployment strategy decisions are increasingly subject to governance requirements. AI regulations in financial services, healthcare, and high-risk AI applications require documented processes for how models are validated before deployment and how degradation is detected and addressed after deployment.
A canary deployment process with defined metric gates and a documented rollback procedure is not just good engineering. It is the evidence that regulators ask for when auditing an AI programme. The deployment runbook, the alert configuration, and the rollback history become compliance artefacts. Design them to serve both purposes from the beginning.
For programmes building governance frameworks alongside technical infrastructure, the AI governance advisory covers how deployment documentation integrates with broader model risk management requirements. The AI data governance guide provides the upstream data controls that complement deployment-level governance.
Summary: The Minimum Viable Deployment Process
For teams deploying their first production models, the minimum viable process is: offline evaluation against a defined metric threshold before promotion, canary deployment to 10% of traffic, 24-hour business metric comparison, and a rollback command that has been tested. Everything beyond this is refinement.
The refinements that matter most at scale are shadow mode for high-stakes models, automated canary progression based on metric gates, and a structured retrospective process for every rollback. Teams that invest in these three refinements reduce their production incident rate significantly faster than teams that invest in more sophisticated tooling without disciplined process.
The AI implementation advisory includes deployment process design as a standard workstream, producing runbooks and metric configurations specific to each model's risk profile. For teams assessing their current deployment maturity, the AI Readiness Assessment benchmarks your current deployment practices against what mature enterprise AI programmes operate.