Task 5.1 — Evaluation systems for GenAI
Why traditional ML metrics don't work
There's no "accuracy" score for a free-form summary. You need a new taxonomy of quality dimensions.
Relevance
Does it answer the question?
- Topical match to query
- Addresses stated intent
Factual accuracy
Are the claims true?
- Verify against ground truth
- Catch hallucinations
Consistency
Same Q → similar A?
- Stability across runs
- Temperature = 0 helps
Fluency
Well-written, coherent?
- Grammar, flow, readability
Groundedness
Supported by retrieved docs? (RAG)
- Every claim traces to a source
- Guardrails contextual grounding checks this
Harmlessness
Free of bias, toxicity, bad content
- Content filter evaluation
- Fairness across groups
Systematic model evaluation with Bedrock
Bedrock Model Eval
Amazon Bedrock Model Evaluations — built-in framework. Automatic metrics (accuracy, robustness, toxicity) + human evaluation jobs + custom metrics.
A/B testing
Route traffic to two models/configurations. Compare quality, cost, latency on real data.
Canary
Route a small percentage of traffic to a new model. Ramp up if metrics hold. Roll back if they degrade.
Multi-model
Side-by-side evaluation of multiple FMs on the same inputs. Token-efficiency, latency-to-quality, cost-per-outcome.
The RAG evaluation triad
RAG systems have three independent failure points, so you evaluate each separately:
1
Context relevance
Are retrieved chunks relevant to query?
2
Groundedness
Is answer supported by context?
3
Answer relevance
Does answer match the query?
Diagnostic logic
Relevant docs retrieved + hallucinated answer = grounding failure (use Guardrails contextual grounding). Wrong docs retrieved = retrieval failure (tune embeddings, chunking, or add hybrid search). Right docs + right answer but off-topic = answer relevance failure (tune prompt).
Retrieval quality testing (for RAG)
Relevance scoring
Are retrieved documents actually relevant to the query?
Context matching
Does the retrieved context contain the answer?
Retrieval latency
How fast is the vector search returning results?
Recall@k
Of all relevant documents, how many appear in the top k results?
Precision@k
Of the top k results, how many are actually relevant?
LLM-as-a-Judge — the scalable eval pattern
Input
Question + Response
From primary FM
Judge
Secondary FM
Scores on rubric
Score
0-10 / pass-fail
Per metric
Aggregate
Dashboard
Trend over time
Alert
Regression detected
Block deploy
Bedrock Model Evaluations supports LLM-as-a-Judge natively. Scalable substitute for human evaluation at orders of magnitude lower cost.
Agent performance evaluation
Task completion rate
Did it finish the job?
- Binary or graded
- End-to-end outcome metric
Tool usage efficiency
Wasted calls?
- Minimal unnecessary tool calls
- Short paths through problem space
Bedrock Agent evals
Built-in
- Framework for agent workflows
- Tracks reasoning steps
Reasoning quality
Multi-step logic correctness
- Judge the intermediate steps
- Not just final answer
Quality gates & deployment validation
Continuous eval
Automated daily/weekly runs against the golden dataset.
Regression testing
Every prompt/model change runs the golden set; block if scores drop.
Automated quality gates
CI/CD pipeline blocks deploy if eval scores below threshold.
Synthetic users
Simulated user workflows test end-to-end before real users see changes.
Task 5.2 — Troubleshoot GenAI applications
Content handling issues
Context window overflow
Input + system prompt + retrieved context exceeds the model's limit. Symptoms: truncated responses, missing info, errors. Fix: dynamic chunking, prompt compression, context pruning.
Truncation errors
Critical info cut off because it was at the end of too-long input. Fix: put critical info early; summarize long docs; use hierarchical retrieval.
FM integration issues
Throttling (429)
Too many requests
- Exponential backoff + jitter
- Provisioned Throughput
- Cross-Region Inference
Timeout (504)
Model took too long
- Streaming response
- Shorter max_tokens
- Async processing via SQS
Content filter blocks
Guardrails caught input/output
- Check Guardrails trace
- Tune thresholds if false positive
- Review denied topics
Invalid model params
Bad request
- Request validation pre-send
- Schema check
Prompt engineering problems
Test frameworks
Prompt testing frameworks — systematic test suites for prompts.
Version compare
Version comparison — diff outputs across prompt versions to identify regressions.
Refinement
Systematic refinement — iterative improvement with tracked metrics.
Common issues: ambiguous instructions · conflicting constraints · format inconsistencies · prompt too long, causing earlier instructions to be forgotten.
Retrieval system issues
Embedding quality
Right semantics captured?
- Domain-fit evaluation
- Sample queries → manual review
Drift monitoring
Embedding model degrades
- Data distribution shifts
- Periodic re-eval needed
Vectorization issues
Indexing problems
- Wrong embedding model
- Dimension mismatches
- Incorrect chunking
Search performance
Slow or low-relevance
- Tune HNSW ef_search
- Add metadata filters
- Try hybrid search
Exam angle — HNSW ef_search
"Users report missing relevant docs that exist in the KB. Embeddings and chunking are good in isolation." → increase
ef_search. This tunes query-time accuracy at the cost of latency.
Prompt maintenance issues
Template testing
Test prompts with diverse inputs to catch edge cases.
CloudWatch Logs
Diagnose prompt confusion (model misinterprets instructions).
X-Ray prompt observability
Trace the full pipeline from template to FM response.
Schema validation
Detect format inconsistencies in parameterized prompts.
Systematic refinement
Regular review and improvement cycles; version control prompts.
Domain 5 summary — what to remember
Eval patterns by situation
- Compare models Bedrock Model Evaluations + A/B
- New deploy risk Canary routing
- RAG quality Triad: context / grounding / answer relevance
- Agent quality Bedrock Agent Evaluations
- Scale eval LLM-as-a-Judge
- Detect regression Golden dataset + automated gates
- Human label SageMaker Ground Truth · A2I
- Bias SageMaker Clarify
Troubleshooting quick map
- Response truncated → context overflow → compress/prune
- Throttled (429) → backoff / Provisioned / Cross-Region
- Hallucination despite RAG → Guardrails contextual grounding
- Missing relevant retrieval → increase ef_search / hybrid search
- Model gives wrong format → JSON schema / structured outputs
- Inconsistent responses → lower temperature
- Agent goes wrong → Agent Tracing to inspect reasoning
- Silent quality drop → golden dataset regression + output diffing
You've covered all five domains
Now head to the Architecture Patterns page to see how all of this assembles into end-to-end designs. Or start running practice questions.