Task 5.1 — Evaluation systems for GenAI

Why traditional ML metrics don't work

There's no "accuracy" score for a free-form summary. You need a new taxonomy of quality dimensions.

Relevance

Does it answer the question?
  • Topical match to query
  • Addresses stated intent

Factual accuracy

Are the claims true?
  • Verify against ground truth
  • Catch hallucinations

Consistency

Same Q → similar A?
  • Stability across runs
  • Temperature = 0 helps

Fluency

Well-written, coherent?
  • Grammar, flow, readability

Groundedness

Supported by retrieved docs? (RAG)
  • Every claim traces to a source
  • Guardrails contextual grounding checks this

Harmlessness

Free of bias, toxicity, bad content
  • Content filter evaluation
  • Fairness across groups

Systematic model evaluation with Bedrock

Bedrock Model Eval Amazon Bedrock Model Evaluations — built-in framework. Automatic metrics (accuracy, robustness, toxicity) + human evaluation jobs + custom metrics.
A/B testing Route traffic to two models/configurations. Compare quality, cost, latency on real data.
Canary Route a small percentage of traffic to a new model. Ramp up if metrics hold. Roll back if they degrade.
Multi-model Side-by-side evaluation of multiple FMs on the same inputs. Token-efficiency, latency-to-quality, cost-per-outcome.

The RAG evaluation triad

RAG systems have three independent failure points, so you evaluate each separately:

1
Context relevance
Are retrieved chunks relevant to query?
2
Groundedness
Is answer supported by context?
3
Answer relevance
Does answer match the query?
Diagnostic logic Relevant docs retrieved + hallucinated answer = grounding failure (use Guardrails contextual grounding). Wrong docs retrieved = retrieval failure (tune embeddings, chunking, or add hybrid search). Right docs + right answer but off-topic = answer relevance failure (tune prompt).

Retrieval quality testing (for RAG)

Relevance scoring
Are retrieved documents actually relevant to the query?
Context matching
Does the retrieved context contain the answer?
Retrieval latency
How fast is the vector search returning results?
Recall@k
Of all relevant documents, how many appear in the top k results?
Precision@k
Of the top k results, how many are actually relevant?

LLM-as-a-Judge — the scalable eval pattern

Input
Question + Response
From primary FM
Judge
Secondary FM
Scores on rubric
Score
0-10 / pass-fail
Per metric
Aggregate
Dashboard
Trend over time
Alert
Regression detected
Block deploy

Bedrock Model Evaluations supports LLM-as-a-Judge natively. Scalable substitute for human evaluation at orders of magnitude lower cost.

Agent performance evaluation

Task completion rate

Did it finish the job?
  • Binary or graded
  • End-to-end outcome metric

Tool usage efficiency

Wasted calls?
  • Minimal unnecessary tool calls
  • Short paths through problem space

Bedrock Agent evals

Built-in
  • Framework for agent workflows
  • Tracks reasoning steps

Reasoning quality

Multi-step logic correctness
  • Judge the intermediate steps
  • Not just final answer

Quality gates & deployment validation

Continuous eval Automated daily/weekly runs against the golden dataset.
Regression testing Every prompt/model change runs the golden set; block if scores drop.
Automated quality gates CI/CD pipeline blocks deploy if eval scores below threshold.
Synthetic users Simulated user workflows test end-to-end before real users see changes.

Task 5.2 — Troubleshoot GenAI applications

Content handling issues

Context window overflow
Input + system prompt + retrieved context exceeds the model's limit. Symptoms: truncated responses, missing info, errors. Fix: dynamic chunking, prompt compression, context pruning.
Truncation errors
Critical info cut off because it was at the end of too-long input. Fix: put critical info early; summarize long docs; use hierarchical retrieval.

FM integration issues

Throttling (429)

Too many requests
  • Exponential backoff + jitter
  • Provisioned Throughput
  • Cross-Region Inference

Timeout (504)

Model took too long
  • Streaming response
  • Shorter max_tokens
  • Async processing via SQS

Content filter blocks

Guardrails caught input/output
  • Check Guardrails trace
  • Tune thresholds if false positive
  • Review denied topics

Invalid model params

Bad request
  • Request validation pre-send
  • Schema check

Prompt engineering problems

Test frameworks Prompt testing frameworks — systematic test suites for prompts.
Version compare Version comparison — diff outputs across prompt versions to identify regressions.
Refinement Systematic refinement — iterative improvement with tracked metrics.

Common issues: ambiguous instructions · conflicting constraints · format inconsistencies · prompt too long, causing earlier instructions to be forgotten.

Retrieval system issues

Embedding quality

Right semantics captured?
  • Domain-fit evaluation
  • Sample queries → manual review

Drift monitoring

Embedding model degrades
  • Data distribution shifts
  • Periodic re-eval needed

Vectorization issues

Indexing problems
  • Wrong embedding model
  • Dimension mismatches
  • Incorrect chunking

Search performance

Slow or low-relevance
  • Tune HNSW ef_search
  • Add metadata filters
  • Try hybrid search
Exam angle — HNSW ef_search "Users report missing relevant docs that exist in the KB. Embeddings and chunking are good in isolation." → increase ef_search. This tunes query-time accuracy at the cost of latency.

Prompt maintenance issues

Template testing
Test prompts with diverse inputs to catch edge cases.
CloudWatch Logs
Diagnose prompt confusion (model misinterprets instructions).
X-Ray prompt observability
Trace the full pipeline from template to FM response.
Schema validation
Detect format inconsistencies in parameterized prompts.
Systematic refinement
Regular review and improvement cycles; version control prompts.

Domain 5 summary — what to remember

Eval patterns by situation

  • Compare models Bedrock Model Evaluations + A/B
  • New deploy risk Canary routing
  • RAG quality Triad: context / grounding / answer relevance
  • Agent quality Bedrock Agent Evaluations
  • Scale eval LLM-as-a-Judge
  • Detect regression Golden dataset + automated gates
  • Human label SageMaker Ground Truth · A2I
  • Bias SageMaker Clarify

Troubleshooting quick map

  • Response truncated → context overflow → compress/prune
  • Throttled (429) → backoff / Provisioned / Cross-Region
  • Hallucination despite RAG → Guardrails contextual grounding
  • Missing relevant retrieval → increase ef_search / hybrid search
  • Model gives wrong format → JSON schema / structured outputs
  • Inconsistent responses → lower temperature
  • Agent goes wrong → Agent Tracing to inspect reasoning
  • Silent quality drop → golden dataset regression + output diffing
You've covered all five domains Now head to the Architecture Patterns page to see how all of this assembles into end-to-end designs. Or start running practice questions.