Decision Tree · Cost Optimization

~ diagnose the cost shape, then pick the matching lever ~

Why these three levers

Semantic Caching Embed the user's query. Compare to embeddings of recent responses in your cache. Above a similarity threshold (typically 0.95+), return the cached answer instead of calling the FM. Works brilliantly for support chatbots, FAQ apps, and anywhere users ask variations of the same questions. Cache storage (DynamoDB / Redis) is negligible vs FM costs saved.

Model Cascading Route the majority of queries to a cheap fast model (Haiku, Nova Micro). Evaluate the output (LLM-as-a-Judge, confidence score, schema validation). Escalate failures to Sonnet/Opus. At high volume with mixed complexity, this is the single biggest lever. See Pattern 5 for the architecture.

Provisioned Throughput / Batch API If traffic is steady 24/7, Provisioned Throughput gives you reserved capacity at a flat rate — cheaper than on-demand at high utilization. If your workload is async (document processing, nightly insights, batch classification), the Bedrock Batch Inference API gives a flat ~50% discount.

Exam angle Three keyword tells: (1) "many similar queries" → caching. (2) "simple vs complex" or "route cheap first" → cascading. (3) "predictable high volume" or "async / batch" → provisioned or batch. Match the phrasing to the lever.

Before pulling any lever — right-size prompts first Audit your prompts before optimizing infra. Long system prompts and verbose few-shot examples waste tokens on every call. Prompt compression often yields 10-20% savings with no architectural changes. Cheapest win, always step one.

Which Cost Lever Do I Pull First?

Why these three levers

Related trees