Architecture diagram

— Model cascading · try cheap first, escalate if quality falls short —
👤 Request CLASSIFY Lambda complexity router TIER 1 · CHEAP Haiku / Nova Micro fast, low cost per call QUALITY meets threshold? YES ✓ RETURN Response to user NO · escalate TIER 2 · POWERFUL Sonnet / Opus higher quality, higher cost answer classifier may route straight to Tier 2 for obviously complex queries final response (either tier) cheap tier / success path powerful tier quality gate escalation path

How data flows

The request first hits a classifier (a Lambda function or a small FM that decides complexity). For most queries, it routes to Tier 1: a cheap, fast model (Haiku, Nova Micro, Nova Lite). A quality gate then evaluates the response — if it's good enough, return it; if not, escalate to Tier 2, a more powerful (and expensive) model.

The quality gate can be deterministic (output passes schema validation, confidence score above threshold, JSON parsable) or model-based (LLM-as-a-Judge scoring the Tier 1 answer). For obviously complex queries, the classifier can skip Tier 1 and go straight to Tier 2 — no need to waste a cheap call on something that clearly needs the big guns.

The economic model If 80% of queries are simple and Tier 1 handles them at 1/10th the cost of Tier 2, and Tier 1 is correct 95% of the time, you pay roughly 15-25% of what you'd pay by always using Tier 2 — for similar overall quality.

AWS services used

Lambda (classifier)Routes requests based on simple heuristics (length, keywords) or an embedding-based classifier.
Bedrock (multiple models)Tier 1 uses small models (Claude Haiku, Nova Micro, Titan Text Lite). Tier 2 uses powerful models (Claude Sonnet/Opus, Nova Pro).
Step FunctionsOrchestrates the cascade. Runs the quality gate step, handles the escalation decision, retries.
CloudWatch MetricsTracks escalation rate, cost per tier, quality scores. Tune the quality gate based on these metrics over time.
Bedrock Model EvaluationsUsed periodically to validate that the cheap tier is actually performing well enough on production-representative traffic.

When to use this pattern

Use Model Cascading when…

  • High volume + variable complexityCustomer support, doc Q&A, content classification. Most queries are simple; a few are hard. Don't pay Opus prices for "what's your phone number?".
  • Cost optimization is a primary goalThis is the #1 GenAI cost lever. Can reduce FM spend by 60-80% at scale with minimal quality loss.
  • Tier 1 model is correct most of the timeNeed >80% pass rate on the cheap tier for the economics to work. Test with a golden dataset first.
  • You have a reliable quality signalJSON schema validation, confidence threshold, or LLM-as-a-Judge that can score the Tier 1 output. Without this, you can't gate correctly.
  • Occasional higher latency is acceptableWhen escalation happens, total latency = Tier 1 + quality check + Tier 2. Typically 2-3x slower for the 20% of escalated queries.

Do NOT use Cascading when…

  • All queries require the powerful model's reasoningLegal analysis, complex multi-step math, long-context synthesis. Tier 1 will fail ~every time — you just add overhead without savings.
  • Tight latency SLA across the boardThe escalation path is too slow. Either commit to Tier 2 for consistency, or redesign the tiers.
  • Low volumeIf you make 1000 calls/day, the engineering complexity isn't worth the savings. Cascading pays off at >10k calls/day.
  • Quality is non-negotiable and easily brokenMedical diagnosis, financial decisions. The 5% of Tier 1 failures are unacceptable.
  • You can't reliably detect Tier 1 failuresNo way to score the cheap output? Then escalation is random — you'll escalate correct answers and ship wrong ones.
  • Tier 1 output drives downstream automationIf Tier 1's answer triggers real-world actions, you need very high confidence on every call. Cascading adds complexity to get there.

Exam angle

Pattern-match shortcuts When a stem mentions "reduce FM costs while maintaining quality," "most queries are simple," or "use cheaper model when possible," model cascading is the answer. Look for option phrasing like "route simple queries to Haiku, escalate complex to Sonnet."
The "just use Haiku for everything" trap Option may suggest "switch all traffic to the cheap model." That eliminates cost but also eliminates the quality you need for the hard 20%. Cascading keeps the option open — you only pay for power when power is needed.
The "fine-tune a small model instead" trap Fine-tuning a small model to match a larger model's quality sounds appealing but is expensive upfront, requires retraining as data changes, and rarely catches up on reasoning tasks. Cascading gets you the economics without the retraining burden.

Keywords that point here

reduce costs model tier cheap model first escalate cascading Haiku then Sonnet cost-effective quality threshold

Related patterns

For caching-based cost optimization (which pairs well with cascading): see semantic caching in the cheat sheet.
For routing by content type (not complexity), use Pattern 6: GenAI Gateway.
For evaluation of the Tier 1 quality over time, see Domain 5: Testing & Validation.