Task 4.1 — Cost optimization & resource efficiency

Token economics — the fundamental

Bedrock charges you by input tokens + output tokens, not per invocation. Roughly 4 English characters = 1 token. A long system prompt multiplied across millions of requests dwarfs the cost of the actual queries. This is the most important concept in the domain.

Exam angle — token cost levers Every cost-reduction question tests which lever applies: input tokens (prompt compression, context pruning) · output tokens (response size controls, concise instructions) · invocation count (caching) · per-token price (smaller model, cascading).

Token efficiency levers

Token estimation

Measure before you cut
  • Count tokens before sending
  • CloudWatch tracks usage
  • Baseline your expensive calls

Context optimization

Only include what matters
  • Don't dump whole documents
  • RAG top-k tuning
  • Remove redundancy

Response controls

Cap the output
  • max_tokens parameter
  • "Be concise" instructions
  • Structured output (shorter than prose)

Prompt compression

Shorten without losing meaning
  • Remove preamble
  • Use abbreviations in context
  • Strip markdown if not needed

Context pruning

Drop old conversation turns
  • Keep recent + summarize old
  • Sliding window
  • Summary-of-summaries pattern

Response limiting

Tell the model to be brief
  • "Answer in 2 sentences"
  • "Return only the answer, no explanation"

Cost-effective model selection

Cost-capability Cost-capability tradeoff — don't use the most expensive model for simple tasks. Haiku/Nova Micro for classification; reserve Opus/Sonnet for reasoning.
Tiered Tiered FM usage — route simple queries to cheap models, complex ones to powerful models. Classifier upfront decides.
Quality balance Inference cost vs. response quality — find the sweet spot per use case through A/B testing.
Ratio tracking Price-to-performance ratio — measure quality-per-dollar across models, not just cost.

High-performance FM systems — throughput

Batching
Batch inference — group multiple requests into single calls. Bedrock offers batch inference with ~50% discount for non-real-time workloads.
Capacity planning
Estimate token volume (not request count). Provision to peak; auto-scale to zero for spiky workloads.
Utilization monitoring
Track how much of provisioned capacity you're actually using. Right-size quarterly.
Auto-scaling
SageMaker endpoints scale on request volume. Bedrock scales managed-side.
Provisioned Throughput
Right-size Bedrock Provisioned Throughput — reserved capacity for predictable workloads. Required for custom models.

Intelligent caching — the layer cake

Layer 1
Edge caching
CloudFront at POPs
Layer 2
Exact match cache
Deterministic hash
Layer 3
Semantic cache
Similar queries
Layer 4
Prompt caching
Bedrock prefix cache
Miss
Bedrock invocation
Full inference
Cache type cheat sheet Semantic caching = cache by meaning ("what's your refund policy" ≈ "how do I get a refund"). Result fingerprinting = hash request characteristics. Edge caching = CloudFront near the user. Deterministic request hashing = identical inputs → guaranteed hit. Prompt caching = Bedrock caches the processed prefix (system prompt, long context) across invocations — big wins for chatbots with stable personas.

Task 4.2 — Application performance optimization

Latency optimization patterns

Pre-computation

Predictable queries
  • Generate responses ahead of time
  • Cache popular FAQ answers
  • Zero inference latency at read

Latency-optimized models

Bedrock variants
  • Lower latency, slightly reduced quality
  • Time-sensitive apps
  • Ask explicit Bedrock latency-opt models

Parallel requests

Complex workflows
  • Independent FM calls concurrent
  • Gather results then synthesize
  • Reduces wall-clock time

Streaming

Perceived latency
  • Tokens shown as generated
  • Time-to-first-token is what users feel
  • Chat UX essential

FM parameter tuning

Temperature
Controls randomness. 0.0–0.3 = deterministic/factual (good for SQL generation, extraction). 0.7–1.0 = creative/varied (brainstorming, writing).
Top-k
Limits sampling to top k most probable tokens. Lower = more focused output.
Top-p (nucleus)
Limits sampling to tokens whose cumulative probability exceeds p. 0.1 = very focused, 0.9 = diverse.
A/B testing
Compare parameter configurations on real traffic. Use Bedrock Prompt Management variants.
Trap — temperature selection Question: "The SQL-generating model sometimes produces incorrect syntax" → answer: lower temperature (0.0–0.2). Don't pick "switch to a larger model" if the fix is parameter tuning.

Retrieval performance

Index tuning HNSW parametersef_search (query-time accuracy/speed), ef_construction (build quality). Higher = better recall, higher latency.
Query preprocessing Clean and expand queries before vector search; remove noise, normalize.
Hybrid scoring Hybrid search with custom scoring — weight keyword vs. semantic results based on query type.

Throughput optimization

  • Token processing optimization — optimize prompt length for throughput (shorter prompts = more throughput per unit of provisioned capacity)
  • Batch inference strategies — for non-real-time workloads, use Bedrock batch for the ~50% discount
  • Concurrent invocation management — manage parallelism without exceeding rate limits; use semaphores, queues

Task 4.3 — Monitoring systems for GenAI

The GenAI observability stack

Metrics CloudWatch — token usage, latency, error rates, throughput, custom metrics. Build dashboards for each.
Invocation logs Bedrock Model Invocation Logs — detailed request/response logging to CloudWatch Logs or S3. Enable per model.
Query CloudWatch Logs Insights — query prompts and responses at scale.
Tracing X-Ray — distributed tracing. Identify latency bottlenecks across the RAG pipeline.
Audit CloudTrail — API-level audit of who invoked which model.
Cost Cost Explorer + Cost Anomaly Detection — spending trends and alerts.
Dashboards Managed Grafana — unified dashboards across AWS services.

GenAI-specific KPIs (not in traditional ML)

Token usage

Cost driver #1
  • Input vs. output tokens per request
  • Per-user, per-team attribution

Prompt effectiveness

Response quality trend
  • Quality scores over time
  • Regression detection

Hallucination rate

Factual accuracy
  • Measure vs. golden dataset
  • Alert on spikes

Response quality

Relevance, coherence
  • LLM-as-a-Judge scoring
  • Human feedback ratings

Token burst detection

Anomaly patterns
  • Sudden spikes → runaway agent?
  • Response drift

Cost anomalies

Unexpected spend
  • Cost Anomaly Detection
  • Per-service alerts

Tool & vector store operations

Tool call patterns
Track which tools get called, how often, in what order. Identify inefficient agent behaviors.
Per-tool metrics
Latency and error rates per tool. One slow tool can tank agent performance.
Multi-agent coordination
Tracing for agent-to-agent handoffs. Where do multi-agent workflows stall?
Usage baselines
Establish normal patterns; alert on deviations.
Vector store perf
Query latency, result count, recall/precision on golden queries.
Index optimization
Automated routines to rebuild/reshard indexes when performance degrades.

GenAI-specific failure mode detection

  • Golden datasets — curated questions with known correct answers. Run periodically; flag regressions.
  • Output diffing — compare responses for the same input over time. Catches silent drift.
  • Reasoning path tracing — for agents, log every thought/action/observation; identify where reasoning goes off.
  • Specialized observability pipelines — custom logging for GenAI-specific issues (context overflow, guardrail blocks, hallucination flags).
Exam angle — golden dataset "Detect if a model update degraded responses" = golden dataset regression test. "Detect silent behavior changes" = output diffing. "Diagnose why an agent gave a wrong answer" = reasoning path tracing / Bedrock Agent Tracing.

Domain 4 summary — what to remember

Cost levers in priority order

  • 1 Smaller model — cheapest way to cut unit cost
  • 2 Prompt compression — cut input tokens
  • 3 Response limiting — cut output tokens
  • 4 Model cascading — cheap-first, escalate
  • 5 Semantic caching — skip invocations
  • 6 Prompt caching — Bedrock prefix cache
  • 7 Batch inference — ~50% discount
  • 8 Provisioned throughput — predictable volumes

The observability map

  • Metrics CloudWatch
  • Logs Bedrock Model Invocation Logs + Logs Insights
  • Tracing X-Ray (distributed) + Agent Tracing (reasoning)
  • Audit CloudTrail
  • Cost Cost Explorer + Cost Anomaly Detection
  • Dashboards Managed Grafana
  • Quality Golden dataset + LLM-as-a-Judge
Next up Continue to Domain 5 — Testing, Validation & Troubleshooting (11% — the smallest domain, but essential). Or see the model cascading pattern diagrammed end-to-end.