~ diagnose the cost shape, then pick the matching lever ~
STEP 1 · DIAGNOSE Where is the money actually going? What's your dominant cost shape? → open Cost Explorer, filter by Bedrock and SageMaker for the last 30 days pick the branch that matches your pattern repeats mixed difficulty steady load Users ask the same questions over and over >30% query overlap Most queries simple, a few need real power 80/20 workload Steady high-volume, predictable inference 24/7 production load Semantic Caching Cache responses keyed by query embedding similarity ⇒ cuts 50-70% Model Cascading Haiku first → escalate to Sonnet only when needed ⇒ cuts 60-80% Provisioned / Batch Reserved capacity flat rate, or Batch API = 50% off ⇒ cuts 30-50% 💡 stack the levers — they compound: semantic cache + cascading: cache hits skip FM entirely; misses flow through cascading provisioned + cascading: commit tier-1 baseline, burst overflow on-demand

Why these three levers

Semantic Caching Embed the user's query. Compare to embeddings of recent responses in your cache. Above a similarity threshold (typically 0.95+), return the cached answer instead of calling the FM. Works brilliantly for support chatbots, FAQ apps, and anywhere users ask variations of the same questions. Cache storage (DynamoDB / Redis) is negligible vs FM costs saved.
Model Cascading Route the majority of queries to a cheap fast model (Haiku, Nova Micro). Evaluate the output (LLM-as-a-Judge, confidence score, schema validation). Escalate failures to Sonnet/Opus. At high volume with mixed complexity, this is the single biggest lever. See Pattern 5 for the architecture.
Provisioned Throughput / Batch API If traffic is steady 24/7, Provisioned Throughput gives you reserved capacity at a flat rate — cheaper than on-demand at high utilization. If your workload is async (document processing, nightly insights, batch classification), the Bedrock Batch Inference API gives a flat ~50% discount.
Exam angle Three keyword tells: (1) "many similar queries" → caching. (2) "simple vs complex" or "route cheap first" → cascading. (3) "predictable high volume" or "async / batch" → provisioned or batch. Match the phrasing to the lever.
Before pulling any lever — right-size prompts first Audit your prompts before optimizing infra. Long system prompts and verbose few-shot examples waste tokens on every call. Prompt compression often yields 10-20% savings with no architectural changes. Cheapest win, always step one.

Related trees

Tree 1: Vector Store Selection · Tree 2: Deployment Choice · Tree 4: RAG Troubleshooting
Full architectural detail: Pattern 5: Model Cascading