Domain 5: Testing, Validation & Troubleshooting

Task 5.1 — Evaluation systems for GenAI

Why traditional ML metrics don't work

There's no "accuracy" score for a free-form summary. You need a new taxonomy of quality dimensions.

Relevance

Does it answer the question?

Topical match to query
Addresses stated intent

Factual accuracy

Are the claims true?

Verify against ground truth
Catch hallucinations

Consistency

Same Q → similar A?

Stability across runs
Temperature = 0 helps

Fluency

Well-written, coherent?

Grammar, flow, readability

Groundedness

Supported by retrieved docs? (RAG)

Every claim traces to a source
Guardrails contextual grounding checks this

Harmlessness

Free of bias, toxicity, bad content

Content filter evaluation
Fairness across groups

Systematic model evaluation with Bedrock

Bedrock Model Eval Amazon Bedrock Model Evaluations — built-in framework. Automatic metrics (accuracy, robustness, toxicity) + human evaluation jobs + custom metrics.

A/B testing Route traffic to two models/configurations. Compare quality, cost, latency on real data.

Canary Route a small percentage of traffic to a new model. Ramp up if metrics hold. Roll back if they degrade.

Multi-model Side-by-side evaluation of multiple FMs on the same inputs. Token-efficiency, latency-to-quality, cost-per-outcome.

The RAG evaluation triad

RAG systems have three independent failure points, so you evaluate each separately:

Context relevance

Are retrieved chunks relevant to query?

Groundedness

Is answer supported by context?

Answer relevance

Does answer match the query?

Diagnostic logic Relevant docs retrieved + hallucinated answer = grounding failure (use Guardrails contextual grounding). Wrong docs retrieved = retrieval failure (tune embeddings, chunking, or add hybrid search). Right docs + right answer but off-topic = answer relevance failure (tune prompt).

Retrieval quality testing (for RAG)

Relevance scoring

Are retrieved documents actually relevant to the query?

Context matching

Does the retrieved context contain the answer?

Retrieval latency

How fast is the vector search returning results?

Recall@k

Of all relevant documents, how many appear in the top k results?

Precision@k

Of the top k results, how many are actually relevant?

LLM-as-a-Judge — the scalable eval pattern

Input

Question + Response

From primary FM

Judge

Secondary FM

Scores on rubric

Score

0-10 / pass-fail

Per metric

Aggregate

Dashboard

Trend over time

Alert

Regression detected

Block deploy

Bedrock Model Evaluations supports LLM-as-a-Judge natively. Scalable substitute for human evaluation at orders of magnitude lower cost.

Agent performance evaluation

Task completion rate

Did it finish the job?

Binary or graded
End-to-end outcome metric

Tool usage efficiency

Wasted calls?

Minimal unnecessary tool calls
Short paths through problem space

Bedrock Agent evals

Built-in

Framework for agent workflows
Tracks reasoning steps

Reasoning quality

Multi-step logic correctness

Judge the intermediate steps
Not just final answer

Quality gates & deployment validation

Continuous eval Automated daily/weekly runs against the golden dataset.

Regression testing Every prompt/model change runs the golden set; block if scores drop.

Automated quality gates CI/CD pipeline blocks deploy if eval scores below threshold.

Synthetic users Simulated user workflows test end-to-end before real users see changes.

Task 5.2 — Troubleshoot GenAI applications

Content handling issues

Context window overflow

Input + system prompt + retrieved context exceeds the model's limit. Symptoms: truncated responses, missing info, errors. Fix: dynamic chunking, prompt compression, context pruning.

Truncation errors

Critical info cut off because it was at the end of too-long input. Fix: put critical info early; summarize long docs; use hierarchical retrieval.

FM integration issues

Throttling (429)

Too many requests

Exponential backoff + jitter
Provisioned Throughput
Cross-Region Inference

Timeout (504)

Model took too long

Streaming response
Shorter max_tokens
Async processing via SQS

Content filter blocks

Guardrails caught input/output

Check Guardrails trace
Tune thresholds if false positive
Review denied topics

Invalid model params

Bad request

Request validation pre-send
Schema check

Prompt engineering problems

Test frameworks Prompt testing frameworks — systematic test suites for prompts.

Version compare Version comparison — diff outputs across prompt versions to identify regressions.

Refinement Systematic refinement — iterative improvement with tracked metrics.

Common issues: ambiguous instructions · conflicting constraints · format inconsistencies · prompt too long, causing earlier instructions to be forgotten.

Retrieval system issues

Embedding quality

Right semantics captured?

Domain-fit evaluation
Sample queries → manual review

Drift monitoring

Embedding model degrades

Data distribution shifts
Periodic re-eval needed

Vectorization issues

Indexing problems

Wrong embedding model
Dimension mismatches
Incorrect chunking

Search performance

Slow or low-relevance

Tune HNSW ef_search
Add metadata filters
Try hybrid search

Exam angle — HNSW ef_search "Users report missing relevant docs that exist in the KB. Embeddings and chunking are good in isolation." → increase ef_search. This tunes query-time accuracy at the cost of latency.

Prompt maintenance issues

Template testing

Test prompts with diverse inputs to catch edge cases.

CloudWatch Logs

Diagnose prompt confusion (model misinterprets instructions).

X-Ray prompt observability

Trace the full pipeline from template to FM response.

Schema validation

Detect format inconsistencies in parameterized prompts.

Systematic refinement

Regular review and improvement cycles; version control prompts.

Domain 5 summary — what to remember

Eval patterns by situation

Compare models Bedrock Model Evaluations + A/B
New deploy risk Canary routing
RAG quality Triad: context / grounding / answer relevance
Agent quality Bedrock Agent Evaluations
Scale eval LLM-as-a-Judge
Detect regression Golden dataset + automated gates
Human label SageMaker Ground Truth · A2I
Bias SageMaker Clarify

Troubleshooting quick map

Response truncated → context overflow → compress/prune
Throttled (429) → backoff / Provisioned / Cross-Region
Hallucination despite RAG → Guardrails contextual grounding
Missing relevant retrieval → increase ef_search / hybrid search
Model gives wrong format → JSON schema / structured outputs
Inconsistent responses → lower temperature
Agent goes wrong → Agent Tracing to inspect reasoning
Silent quality drop → golden dataset regression + output diffing

You've covered all five domains Now head to the Architecture Patterns page to see how all of this assembles into end-to-end designs. Or start running practice questions.