GenAI bills add up fast. When the finance team says "cut this," the right lever depends on the shape of your cost. Diagnose first, then pick.
~ diagnose the cost shape, then pick the matching lever ~
Why these three levers
Semantic Caching
Embed the user's query. Compare to embeddings of recent responses in your cache. Above a similarity threshold (typically 0.95+), return the cached answer instead of calling the FM. Works brilliantly for support chatbots, FAQ apps, and anywhere users ask variations of the same questions. Cache storage (DynamoDB / Redis) is negligible vs FM costs saved.
Model Cascading
Route the majority of queries to a cheap fast model (Haiku, Nova Micro). Evaluate the output (LLM-as-a-Judge, confidence score, schema validation). Escalate failures to Sonnet/Opus. At high volume with mixed complexity, this is the single biggest lever. See Pattern 5 for the architecture.
Provisioned Throughput / Batch API
If traffic is steady 24/7, Provisioned Throughput gives you reserved capacity at a flat rate — cheaper than on-demand at high utilization. If your workload is async (document processing, nightly insights, batch classification), the Bedrock Batch Inference API gives a flat ~50% discount.
Exam angle
Three keyword tells: (1) "many similar queries" → caching. (2) "simple vs complex" or "route cheap first" → cascading. (3) "predictable high volume" or "async / batch" → provisioned or batch. Match the phrasing to the lever.
Before pulling any lever — right-size prompts first
Audit your prompts before optimizing infra. Long system prompts and verbose few-shot examples waste tokens on every call. Prompt compression often yields 10-20% savings with no architectural changes. Cheapest win, always step one.