Pattern 9 of 10 · Safe deployment for probabilistic systems
CI/CD for GenAI
Traditional CI/CD assumes deterministic outputs — same input, same output, assertEquals works. GenAI doesn't. You need a pipeline that scores probabilistic outputs against a golden dataset, validates guardrails still behave, and deploys via canary so degradation can be rolled back before most users see it.
Architecture diagram
— CI/CD for GenAI · left to right deployment pipeline with GenAI-specific gates —
How data flows
A commit (new prompt, new Lambda logic, new guardrail config, new model version) triggers CodePipeline. CodeBuild packages the change. Then the pipeline hits the GenAI-specific test stage — prompt regression against a golden dataset, guardrail behavior tests, prompt injection tests, quality thresholds. If any gate fails, the pipeline blocks and notifies the developer.
If tests pass, the change deploys as a canary — a small percentage of production traffic. Metrics are watched for a period (latency, error rate, quality scores, cost). If the canary holds up against the baseline, full rollout proceeds. If metrics degrade, automatic rollback.
What's actually different from traditional CI/CD
Traditional CI/CD tests are deterministic — same input, same output, pass/fail. GenAI outputs are probabilistic. You can't assert equality on FM outputs. Instead you score them (LLM-as-a-Judge, similarity to reference, classifier on tone) and gate on thresholds. That's the whole trick.
AWS services used
AWS CodePipelineOrchestrates the whole deployment flow. Each stage is a manual or automated gate.
AWS CodeBuildPackages the Lambda, Agent definition, or prompt templates. Also runs the GenAI test scripts.
Bedrock Model EvaluationsRun as part of the test stage. Score the changed pipeline against a golden dataset and compare to baseline.
Amazon CloudWatchMetrics for the canary stage: latency, error rate, token usage, custom quality scores.
AWS AppConfigFeature flags for the canary ramp. Gradually shift traffic percentage without redeploying.
AWS CodeDeployHandles the traffic-shifting mechanics for Lambda canary deployments.
Amazon SNSAlerts on pipeline failures, canary degradation, rollback events.
When to use this pattern
✓ Use CI/CD for GenAI when…
Prompts / agents are production artifactsAnything in front of users needs a deploy pipeline. Prompts are code — they need tests, review, rollback.
You iterate on prompts frequentlyEvery prompt tweak should go through the pipeline. Catching regressions before prod saves customer pain.
Multiple people / teams ship changesReview gates + automated tests prevent "Alice's prompt change broke Bob's use case."
Compliance / audit requires deployment traceabilityRegulated industries need "what went out, who approved, when." Pipelines produce this naturally.
Model versions changePin model versions in config; bump them through the pipeline so you test before shipping. Never ship an untested model upgrade.
✗ Skip GenAI CI/CD when…
Prototype or PoC stageBuilding CI/CD for a 2-week experiment is premature. Build the thing first; wrap it in a pipeline when it's real.
No test data existsGolden datasets are the foundation of GenAI testing. If you don't have one, build it before the pipeline. Pipeline without tests is theater.
No quality scoring capabilityIf you can't programmatically score FM output, you can't gate on quality. Do a manual eval process first, automate later.
Single developer, no stakesInternal tool used by one team with low-risk outputs. A light "run tests locally" process may be sufficient.
All changes require human evalSome creative use cases genuinely need human review per deploy. CI/CD can still gate, but expect longer cycles.
Exam angle
Pattern-match shortcuts
When a stem asks "what testing step is unique to GenAI pipelines?" — the answer is prompt regression against a golden dataset or guardrail validation. Unit tests, integration tests, IAM scans all exist in any pipeline; these two don't.
The "assertEquals(output, 'expected')" trap
Traditional tests fail on GenAI. FM outputs are probabilistic — the same input can produce slightly different outputs. Test options that say "assert the FM returns exactly X" are wrong. Correct options use scoring against a reference, LLM-as-a-Judge, or threshold-based quality metrics.
Canary is the right mental model
Blue/green for GenAI is hard — you can't validate "the new version is equivalent" before swap. Canary shifts traffic gradually while watching metrics; if something's off, you pull back before most users are affected. Much safer for probabilistic systems.
Keywords that point here
CI/CDprompt regressiongolden datasetquality gatecanary deploymentautomated rollbackguardrail validationBedrock Model Evaluations