Task 4.1 — Cost optimization & resource efficiency
Token economics — the fundamental
Bedrock charges you by input tokens + output tokens, not per invocation. Roughly 4 English characters = 1 token. A long system prompt multiplied across millions of requests dwarfs the cost of the actual queries. This is the most important concept in the domain.
Exam angle — token cost levers
Every cost-reduction question tests which lever applies: input tokens (prompt compression, context pruning) · output tokens (response size controls, concise instructions) · invocation count (caching) · per-token price (smaller model, cascading).
Token efficiency levers
Token estimation
Measure before you cut
- Count tokens before sending
- CloudWatch tracks usage
- Baseline your expensive calls
Context optimization
Only include what matters
- Don't dump whole documents
- RAG top-k tuning
- Remove redundancy
Response controls
Cap the output
- max_tokens parameter
- "Be concise" instructions
- Structured output (shorter than prose)
Prompt compression
Shorten without losing meaning
- Remove preamble
- Use abbreviations in context
- Strip markdown if not needed
Context pruning
Drop old conversation turns
- Keep recent + summarize old
- Sliding window
- Summary-of-summaries pattern
Response limiting
Tell the model to be brief
- "Answer in 2 sentences"
- "Return only the answer, no explanation"
Cost-effective model selection
Cost-capability
Cost-capability tradeoff — don't use the most expensive model for simple tasks. Haiku/Nova Micro for classification; reserve Opus/Sonnet for reasoning.
Tiered
Tiered FM usage — route simple queries to cheap models, complex ones to powerful models. Classifier upfront decides.
Quality balance
Inference cost vs. response quality — find the sweet spot per use case through A/B testing.
Ratio tracking
Price-to-performance ratio — measure quality-per-dollar across models, not just cost.
High-performance FM systems — throughput
Batching
Batch inference — group multiple requests into single calls. Bedrock offers batch inference with ~50% discount for non-real-time workloads.
Capacity planning
Estimate token volume (not request count). Provision to peak; auto-scale to zero for spiky workloads.
Utilization monitoring
Track how much of provisioned capacity you're actually using. Right-size quarterly.
Auto-scaling
SageMaker endpoints scale on request volume. Bedrock scales managed-side.
Provisioned Throughput
Right-size Bedrock Provisioned Throughput — reserved capacity for predictable workloads. Required for custom models.
Intelligent caching — the layer cake
Layer 1
Edge caching
CloudFront at POPs
Layer 2
Exact match cache
Deterministic hash
Layer 3
Semantic cache
Similar queries
Layer 4
Prompt caching
Bedrock prefix cache
Miss
Bedrock invocation
Full inference
Cache type cheat sheet
Semantic caching = cache by meaning ("what's your refund policy" ≈ "how do I get a refund"). Result fingerprinting = hash request characteristics. Edge caching = CloudFront near the user. Deterministic request hashing = identical inputs → guaranteed hit. Prompt caching = Bedrock caches the processed prefix (system prompt, long context) across invocations — big wins for chatbots with stable personas.
Task 4.2 — Application performance optimization
Latency optimization patterns
Pre-computation
Predictable queries
- Generate responses ahead of time
- Cache popular FAQ answers
- Zero inference latency at read
Latency-optimized models
Bedrock variants
- Lower latency, slightly reduced quality
- Time-sensitive apps
- Ask explicit Bedrock latency-opt models
Parallel requests
Complex workflows
- Independent FM calls concurrent
- Gather results then synthesize
- Reduces wall-clock time
Streaming
Perceived latency
- Tokens shown as generated
- Time-to-first-token is what users feel
- Chat UX essential
FM parameter tuning
Temperature
Controls randomness. 0.0–0.3 = deterministic/factual (good for SQL generation, extraction). 0.7–1.0 = creative/varied (brainstorming, writing).
Top-k
Limits sampling to top k most probable tokens. Lower = more focused output.
Top-p (nucleus)
Limits sampling to tokens whose cumulative probability exceeds p. 0.1 = very focused, 0.9 = diverse.
A/B testing
Compare parameter configurations on real traffic. Use Bedrock Prompt Management variants.
Trap — temperature selection
Question: "The SQL-generating model sometimes produces incorrect syntax" → answer: lower temperature (0.0–0.2). Don't pick "switch to a larger model" if the fix is parameter tuning.
Retrieval performance
Index tuning
HNSW parameters —
ef_search (query-time accuracy/speed), ef_construction (build quality). Higher = better recall, higher latency.
Query preprocessing
Clean and expand queries before vector search; remove noise, normalize.
Hybrid scoring
Hybrid search with custom scoring — weight keyword vs. semantic results based on query type.
Throughput optimization
- Token processing optimization — optimize prompt length for throughput (shorter prompts = more throughput per unit of provisioned capacity)
- Batch inference strategies — for non-real-time workloads, use Bedrock batch for the ~50% discount
- Concurrent invocation management — manage parallelism without exceeding rate limits; use semaphores, queues
Task 4.3 — Monitoring systems for GenAI
The GenAI observability stack
Metrics
CloudWatch — token usage, latency, error rates, throughput, custom metrics. Build dashboards for each.
Invocation logs
Bedrock Model Invocation Logs — detailed request/response logging to CloudWatch Logs or S3. Enable per model.
Query
CloudWatch Logs Insights — query prompts and responses at scale.
Tracing
X-Ray — distributed tracing. Identify latency bottlenecks across the RAG pipeline.
Audit
CloudTrail — API-level audit of who invoked which model.
Cost
Cost Explorer + Cost Anomaly Detection — spending trends and alerts.
Dashboards
Managed Grafana — unified dashboards across AWS services.
GenAI-specific KPIs (not in traditional ML)
Token usage
Cost driver #1
- Input vs. output tokens per request
- Per-user, per-team attribution
Prompt effectiveness
Response quality trend
- Quality scores over time
- Regression detection
Hallucination rate
Factual accuracy
- Measure vs. golden dataset
- Alert on spikes
Response quality
Relevance, coherence
- LLM-as-a-Judge scoring
- Human feedback ratings
Token burst detection
Anomaly patterns
- Sudden spikes → runaway agent?
- Response drift
Cost anomalies
Unexpected spend
- Cost Anomaly Detection
- Per-service alerts
Tool & vector store operations
Tool call patterns
Track which tools get called, how often, in what order. Identify inefficient agent behaviors.
Per-tool metrics
Latency and error rates per tool. One slow tool can tank agent performance.
Multi-agent coordination
Tracing for agent-to-agent handoffs. Where do multi-agent workflows stall?
Usage baselines
Establish normal patterns; alert on deviations.
Vector store perf
Query latency, result count, recall/precision on golden queries.
Index optimization
Automated routines to rebuild/reshard indexes when performance degrades.
GenAI-specific failure mode detection
- Golden datasets — curated questions with known correct answers. Run periodically; flag regressions.
- Output diffing — compare responses for the same input over time. Catches silent drift.
- Reasoning path tracing — for agents, log every thought/action/observation; identify where reasoning goes off.
- Specialized observability pipelines — custom logging for GenAI-specific issues (context overflow, guardrail blocks, hallucination flags).
Exam angle — golden dataset
"Detect if a model update degraded responses" = golden dataset regression test. "Detect silent behavior changes" = output diffing. "Diagnose why an agent gave a wrong answer" = reasoning path tracing / Bedrock Agent Tracing.
Domain 4 summary — what to remember
Cost levers in priority order
- 1 Smaller model — cheapest way to cut unit cost
- 2 Prompt compression — cut input tokens
- 3 Response limiting — cut output tokens
- 4 Model cascading — cheap-first, escalate
- 5 Semantic caching — skip invocations
- 6 Prompt caching — Bedrock prefix cache
- 7 Batch inference — ~50% discount
- 8 Provisioned throughput — predictable volumes
The observability map
- Metrics CloudWatch
- Logs Bedrock Model Invocation Logs + Logs Insights
- Tracing X-Ray (distributed) + Agent Tracing (reasoning)
- Audit CloudTrail
- Cost Cost Explorer + Cost Anomaly Detection
- Dashboards Managed Grafana
- Quality Golden dataset + LLM-as-a-Judge
Next up
Continue to Domain 5 — Testing, Validation & Troubleshooting (11% — the smallest domain, but essential). Or see the model cascading pattern diagrammed end-to-end.