Why Model Deployment Is Different from Software Deployment

Software deployment has a clear success criterion: does the application run without errors? Model deployment has an additional dimension: does the model produce correct outputs? A model can deploy successfully at the infrastructure level while producing predictions that are meaningfully worse than its predecessor. Standard deployment tooling does not detect this. Only business metric monitoring does.

This distinction shapes every deployment strategy decision. The validation gates for model deployment are more complex than for software deployment. The rollback triggers are different. The definition of "working correctly" requires business context, not just technical health checks. Teams that treat model deployment as a subset of software deployment will experience production failures that their monitoring did not predict and their runbooks did not address.

Core Principle

A model deployment strategy is not a technical protocol. It is a risk management framework. The right strategy for a model that informs patient triage decisions is different from the right strategy for a model that personalises product recommendations. Match the strategy to the blast radius.

The Four Core Deployment Strategies

These four strategies cover the deployment scenarios encountered across enterprise AI programs. In practice, teams layer them: shadow mode builds confidence before canary, canary validates business metrics before full rollout, blue/green enables instant rollback.

Strategy 1
Blue/Green Deployment
Instant switching, maximum rollback speed
How
Two identical environments run simultaneously. Traffic switches atomically from blue to green. Blue remains warm for immediate rollback.
Best for
Models where gradual validation is impractical and instant rollback capability is required. High-SLA production systems.
Cost
2x compute during transition period (typically 24 to 72 hours before blue is terminated)
Rollback
Instantaneous. Single traffic routing change.
Strategy 2
Canary Deployment
Gradual rollout, business metric validation
How
New model version receives a small traffic percentage (5 to 10%) initially. Percentage increases in stages as business metrics are validated.
Best for
High-volume models where business metric comparison at scale is essential before full promotion. Most production model updates.
Timeline
Typical progression: 5% for 24h, 25% for 24h, 100% if metrics clear
Rollback
Route 100% of traffic back to champion. Fast but not instantaneous.
Strategy 3
Shadow Mode
Zero exposure, maximum safety
How
New model receives the same input as the champion but its outputs are logged, not served. Comparison happens offline with no user exposure.
Best for
High-stakes models where even 5% exposure to a degraded version has unacceptable consequences. Credit, medical, safety-critical applications.
Cost
Double inference cost during shadow period plus logging and analysis infrastructure
Limitation
Cannot validate real business impact until actual exposure. Some failure modes only manifest at serving time.
Strategy 4
Feature Flag Deployment
Cohort control, experiment integration
How
Model version selection is governed by feature flags. Specific user cohorts receive the new model. Integrates with experiment tracking for attribution.
Best for
A/B tests that require precise cohort definition, staged rollout to specific customer segments, or tight integration with business experiments.
Requires
Existing feature flag infrastructure and experiment tracking integration. More complex to implement than canary for simple updates.
Advantage
Business metric attribution per variant is cleaner than canary. Cohort persistence maintains experimental validity.

Matching Strategy to Risk Profile

The selection criteria for deployment strategy are not primarily technical. They follow from three questions: what is the blast radius if the new model is worse? How quickly can you detect degradation? And what is the cost of delayed rollout?

Factor Blue/Green Canary Shadow Feature Flag
High stakes (errors costly or irreversible) Partial Partial Yes Partial
Instant rollback required Yes No Yes (N/A) No
Business metric validation needed No Yes No (offline) Yes
Low additional infrastructure cost Brief 2x Yes No (2x) Yes
Works without flag infrastructure Yes Yes Yes No
Cohort-level experiment control No No No Yes

In practice, the most common enterprise deployment sequence is: shadow mode for pre-production confidence, followed by canary for production validation, with blue/green as the underlying infrastructure to enable rapid rollback if the canary metrics go negative. Feature flags are layered on for teams that need cohort-level precision beyond what standard canary provides.

The Deployment Gates: What to Check Before Each Stage

Every deployment stage transition requires explicit gates. The temptation to skip gates when the model "looks good" is highest precisely when it should be resisted. The model that skipped the evaluation gate is the one that causes the 2am incident.

1
Pre-Promotion: Offline Evaluation
Compare challenger against champion on holdout test set with slice-level analysis. Run performance benchmark under simulated production load. Check bias and fairness metrics against defined thresholds.
Gate: challenger must meet quality threshold AND pass bias checks AND meet latency SLA at p99
2
Shadow Period: Divergence Analysis
Run shadow mode for a statistically significant request sample (minimum 10,000 requests or 48 hours, whichever comes later). Compare shadow predictions against champion predictions and any available ground truth.
Gate: prediction divergence rate below defined threshold; no anomalous output distribution
3
Canary: Business Metric Gate
At each canary stage (5%, 25%, 50%), compare business metrics between canary and champion cohorts. Metrics must include at least one that directly measures business impact, not just model accuracy.
Gate: primary business metric within 2% of champion; error rate within 0.5%; latency p99 within SLA
4
Full Rollout: 72-Hour Watch Period
After 100% traffic migration, maintain heightened monitoring for 72 hours. Keep rollback infrastructure warm until the watch period completes. Do not terminate the previous model version until the watch period passes without incident.
Gate: 72 hours with no degradation signal; monitoring alerts clear; on-call team briefed

The Rollback Plan: Practice Before You Need It

Every model deployment plan must include a rollback plan that has been tested in staging. A rollback plan that exists only in a document and has never been executed is a false comfort. The sequence below should be practiced quarterly as part of the operational readiness programme for every production model.

01
Detect: Alert Triggers
Target: Within 5 minutes of degradation onset
Automated alerts on prediction error rate, business metric deviation, and latency SLA breach. Alerts go to the on-call engineer, not a shared inbox. PagerDuty or equivalent with escalation rules.
02
Triage: Is This a Rollback Situation?
Target: Decision within 10 minutes of alert
On-call engineer checks three metrics: error rate trend, business metric trend, and latency trend. Defined thresholds for each. If any threshold is breached, rollback is initiated without further approval. No committee decision at 2am.
03
Rollback: Execute the Runbook
Target: Traffic restored to previous version within 5 minutes
A single command or UI action that routes traffic back to the previous model version. Previous version must still be running (blue/green) or must be in the model registry and deployable in under 5 minutes (canary). Document the exact command.
04
Confirm: Verify Restoration
Target: Confirmation within 15 minutes of rollback initiation
Verify that error rates, business metrics, and latency have returned to pre-deployment levels. Log the rollback event with timestamp, trigger, and impact estimate. Notify stakeholders via defined communication channel.
05
Retrospective: What Went Wrong
Target: Retrospective within 48 hours
Structured blameless retrospective. Root cause analysis. Update the deployment checklist to prevent recurrence. Share findings with the platform team. Track open items to closure before the next deployment attempt.

Automating Deployment Decisions

Manual deployment approval gates are the right starting point. As teams build confidence in their metrics and thresholds, progressive automation is possible and eventually necessary to support a portfolio of 20 or more production models.

The components that can be automated safely: data validation before training, offline evaluation against the champion model, performance benchmarking against the latency SLA, and canary stage progression if all metrics are within defined bounds. The components that should retain human approval even in mature programmes: the final gate to 100% traffic for models with high blast radius, any deployment that is outside normal parameters, and any deployment following a recent rollback.

The automation framework for deployment decisions integrates with the ML CI/CD pipeline described in the production platform guide. The pipeline makes decisions based on metric thresholds; humans set the thresholds and review exceptions. This separation of concerns scales while maintaining accountability.

Deployment Strategy and Governance

Deployment strategy decisions are increasingly subject to governance requirements. AI regulations in financial services, healthcare, and high-risk AI applications require documented processes for how models are validated before deployment and how degradation is detected and addressed after deployment.

A canary deployment process with defined metric gates and a documented rollback procedure is not just good engineering. It is the evidence that regulators ask for when auditing an AI programme. The deployment runbook, the alert configuration, and the rollback history become compliance artefacts. Design them to serve both purposes from the beginning.

For programmes building governance frameworks alongside technical infrastructure, the AI governance advisory covers how deployment documentation integrates with broader model risk management requirements. The AI data governance guide provides the upstream data controls that complement deployment-level governance.

Summary: The Minimum Viable Deployment Process

For teams deploying their first production models, the minimum viable process is: offline evaluation against a defined metric threshold before promotion, canary deployment to 10% of traffic, 24-hour business metric comparison, and a rollback command that has been tested. Everything beyond this is refinement.

The refinements that matter most at scale are shadow mode for high-stakes models, automated canary progression based on metric gates, and a structured retrospective process for every rollback. Teams that invest in these three refinements reduce their production incident rate significantly faster than teams that invest in more sophisticated tooling without disciplined process.

The AI implementation advisory includes deployment process design as a standard workstream, producing runbooks and metric configurations specific to each model's risk profile. For teams assessing their current deployment maturity, the AI Readiness Assessment benchmarks your current deployment practices against what mature enterprise AI programmes operate.