Domain 4: Operational Efficiency & Optimization

Task 4.1 — Cost optimization & resource efficiency

Token economics — the fundamental

Bedrock charges you by input tokens + output tokens, not per invocation. Roughly 4 English characters = 1 token. A long system prompt multiplied across millions of requests dwarfs the cost of the actual queries. This is the most important concept in the domain.

Exam angle — token cost levers Every cost-reduction question tests which lever applies: input tokens (prompt compression, context pruning) · output tokens (response size controls, concise instructions) · invocation count (caching) · per-token price (smaller model, cascading).

Token efficiency levers

Token estimation

Measure before you cut

Count tokens before sending
CloudWatch tracks usage
Baseline your expensive calls

Context optimization

Only include what matters

Don't dump whole documents
RAG top-k tuning
Remove redundancy

Response controls

Cap the output

max_tokens parameter
"Be concise" instructions
Structured output (shorter than prose)

Prompt compression

Shorten without losing meaning

Remove preamble
Use abbreviations in context
Strip markdown if not needed

Context pruning

Drop old conversation turns

Keep recent + summarize old
Sliding window
Summary-of-summaries pattern

Response limiting

Tell the model to be brief

"Answer in 2 sentences"
"Return only the answer, no explanation"

Cost-effective model selection

Cost-capability Cost-capability tradeoff — don't use the most expensive model for simple tasks. Haiku/Nova Micro for classification; reserve Opus/Sonnet for reasoning.

Tiered Tiered FM usage — route simple queries to cheap models, complex ones to powerful models. Classifier upfront decides.

Quality balance Inference cost vs. response quality — find the sweet spot per use case through A/B testing.

Ratio tracking Price-to-performance ratio — measure quality-per-dollar across models, not just cost.

High-performance FM systems — throughput

Batching

Batch inference — group multiple requests into single calls. Bedrock offers batch inference with ~50% discount for non-real-time workloads.

Capacity planning

Estimate token volume (not request count). Provision to peak; auto-scale to zero for spiky workloads.

Utilization monitoring

Track how much of provisioned capacity you're actually using. Right-size quarterly.

Auto-scaling

SageMaker endpoints scale on request volume. Bedrock scales managed-side.

Provisioned Throughput

Right-size Bedrock Provisioned Throughput — reserved capacity for predictable workloads. Required for custom models.

Intelligent caching — the layer cake

Layer 1

Edge caching

CloudFront at POPs

Layer 2

Exact match cache

Deterministic hash

Layer 3

Semantic cache

Similar queries

Layer 4

Prompt caching

Bedrock prefix cache

Miss

Bedrock invocation

Full inference

Cache type cheat sheet Semantic caching = cache by meaning ("what's your refund policy" ≈ "how do I get a refund"). Result fingerprinting = hash request characteristics. Edge caching = CloudFront near the user. Deterministic request hashing = identical inputs → guaranteed hit. Prompt caching = Bedrock caches the processed prefix (system prompt, long context) across invocations — big wins for chatbots with stable personas.

Task 4.2 — Application performance optimization

Latency optimization patterns

Pre-computation

Predictable queries

Generate responses ahead of time
Cache popular FAQ answers
Zero inference latency at read

Latency-optimized models

Bedrock variants

Lower latency, slightly reduced quality
Time-sensitive apps
Ask explicit Bedrock latency-opt models

Parallel requests

Complex workflows

Independent FM calls concurrent
Gather results then synthesize
Reduces wall-clock time

Streaming

Perceived latency

Tokens shown as generated
Time-to-first-token is what users feel
Chat UX essential

FM parameter tuning

Temperature

Controls randomness. 0.0–0.3 = deterministic/factual (good for SQL generation, extraction). 0.7–1.0 = creative/varied (brainstorming, writing).

Top-k

Limits sampling to top k most probable tokens. Lower = more focused output.

Top-p (nucleus)

Limits sampling to tokens whose cumulative probability exceeds p. 0.1 = very focused, 0.9 = diverse.

A/B testing

Compare parameter configurations on real traffic. Use Bedrock Prompt Management variants.

Trap — temperature selection Question: "The SQL-generating model sometimes produces incorrect syntax" → answer: lower temperature (0.0–0.2). Don't pick "switch to a larger model" if the fix is parameter tuning.

Retrieval performance

Index tuning HNSW parameters — ef_search (query-time accuracy/speed), ef_construction (build quality). Higher = better recall, higher latency.

Query preprocessing Clean and expand queries before vector search; remove noise, normalize.

Hybrid scoring Hybrid search with custom scoring — weight keyword vs. semantic results based on query type.

Throughput optimization

Token processing optimization — optimize prompt length for throughput (shorter prompts = more throughput per unit of provisioned capacity)
Batch inference strategies — for non-real-time workloads, use Bedrock batch for the ~50% discount
Concurrent invocation management — manage parallelism without exceeding rate limits; use semaphores, queues

Task 4.3 — Monitoring systems for GenAI

The GenAI observability stack

Metrics CloudWatch — token usage, latency, error rates, throughput, custom metrics. Build dashboards for each.

Invocation logs Bedrock Model Invocation Logs — detailed request/response logging to CloudWatch Logs or S3. Enable per model.

Query CloudWatch Logs Insights — query prompts and responses at scale.

Tracing X-Ray — distributed tracing. Identify latency bottlenecks across the RAG pipeline.

Audit CloudTrail — API-level audit of who invoked which model.

Cost Cost Explorer + Cost Anomaly Detection — spending trends and alerts.

Dashboards Managed Grafana — unified dashboards across AWS services.

GenAI-specific KPIs (not in traditional ML)

Token usage

Cost driver #1

Input vs. output tokens per request
Per-user, per-team attribution

Prompt effectiveness

Response quality trend

Quality scores over time
Regression detection

Hallucination rate

Factual accuracy

Measure vs. golden dataset
Alert on spikes

Response quality

Relevance, coherence

LLM-as-a-Judge scoring
Human feedback ratings

Token burst detection

Anomaly patterns

Sudden spikes → runaway agent?
Response drift

Cost anomalies

Unexpected spend

Cost Anomaly Detection
Per-service alerts

Tool & vector store operations

Tool call patterns

Track which tools get called, how often, in what order. Identify inefficient agent behaviors.

Per-tool metrics

Latency and error rates per tool. One slow tool can tank agent performance.

Multi-agent coordination

Tracing for agent-to-agent handoffs. Where do multi-agent workflows stall?

Usage baselines

Establish normal patterns; alert on deviations.

Vector store perf

Query latency, result count, recall/precision on golden queries.

Index optimization

Automated routines to rebuild/reshard indexes when performance degrades.

GenAI-specific failure mode detection

Golden datasets — curated questions with known correct answers. Run periodically; flag regressions.
Output diffing — compare responses for the same input over time. Catches silent drift.
Reasoning path tracing — for agents, log every thought/action/observation; identify where reasoning goes off.
Specialized observability pipelines — custom logging for GenAI-specific issues (context overflow, guardrail blocks, hallucination flags).

Exam angle — golden dataset "Detect if a model update degraded responses" = golden dataset regression test. "Detect silent behavior changes" = output diffing. "Diagnose why an agent gave a wrong answer" = reasoning path tracing / Bedrock Agent Tracing.

Domain 4 summary — what to remember

Cost levers in priority order

1 Smaller model — cheapest way to cut unit cost
2 Prompt compression — cut input tokens
3 Response limiting — cut output tokens
4 Model cascading — cheap-first, escalate
5 Semantic caching — skip invocations
6 Prompt caching — Bedrock prefix cache
7 Batch inference — ~50% discount
8 Provisioned throughput — predictable volumes

The observability map

Metrics CloudWatch
Logs Bedrock Model Invocation Logs + Logs Insights
Tracing X-Ray (distributed) + Agent Tracing (reasoning)
Audit CloudTrail
Cost Cost Explorer + Cost Anomaly Detection
Dashboards Managed Grafana
Quality Golden dataset + LLM-as-a-Judge

Next up Continue to Domain 5 — Testing, Validation & Troubleshooting (11% — the smallest domain, but essential). Or see the model cascading pattern diagrammed end-to-end.