Key Terms Glossary

Foundation Model Concepts

Foundation Model (FM)

A large AI model pre-trained on massive datasets that can be adapted to a wide range of tasks. Examples: Claude, GPT, Llama, Amazon Titan. The exam focuses on using FMs, not building them.

Large Language Model (LLM)

A type of FM specifically for text tasks. Subset of FMs.

Token

The basic unit of text that FMs process. Roughly 4 English characters = 1 token. Cost is per input/output token.

Context Window

Max tokens a model can process in one request (input + output combined). Exceeding it causes truncation or errors.

Inference

Sending input to an FM and getting a response. Each inference costs tokens.

Fine-Tuning

Further training a pre-trained FM on domain-specific data. More expensive than prompt engineering. Use when prompting alone can't achieve goals.

LoRA (Low-Rank Adaptation)

Parameter-efficient fine-tuning. Trains a small set of additional parameters instead of the full model. Cheaper and faster than full fine-tuning.

Adapters

Small trainable modules inserted into a pre-trained model. Similar concept to LoRA — parameter-efficient fine-tuning.

Prompt Engineering

Crafting inputs to FMs to get desired outputs without modifying the model. Cheaper and faster than fine-tuning.

System Prompt

Instructions given to the FM that define its role, behavior, constraints. Persists across conversation, invisible to end user.

Temperature

Controls randomness. 0 = deterministic. 1.0 = max randomness. Low for factual tasks, high for creative.

Top-k Sampling

Limits model to choose from the top k most probable tokens. Lower k = more focused output.

Top-p (Nucleus Sampling)

Limits model to tokens whose cumulative probability exceeds p. Alternative to top-k.

Hallucination

When an FM generates plausible but factually incorrect content. RAG and grounding techniques reduce this.

Grounding

Anchoring FM responses in verifiable source data (via RAG, fact-checking). Opposite of hallucination.

Multimodal

Models that can process multiple data types (text + images, text + audio).

RAG & Vector Concepts

RAG (Retrieval Augmented Generation)

Architecture: user question → retrieve relevant docs from a KB → retrieved docs + question → FM → grounded answer.

Embedding

Numerical vector representation of data. Similar content → similar embeddings. Used for semantic search.

Vector Database

Database optimized for storing/querying high-dimensional vectors. Enables semantic similarity search.

Semantic Search

Finding results based on meaning rather than exact keyword matching.

Hybrid Search

Combining traditional keyword search (BM25) with semantic vector search. Often outperforms either alone.

Chunking

Breaking large documents into smaller pieces for embedding and retrieval. Strategies: fixed-size, hierarchical, semantic.

Reranking

Second-stage retrieval. After initial search returns N results, a reranker model reorders them by relevance. Improves precision.

k-NN (k-Nearest Neighbors)

Algorithm that finds the k most similar vectors to a query vector. Core of vector search.

HNSW (Hierarchical Navigable Small World)

Efficient approximate nearest neighbor algorithm used by OpenSearch. Tuning params: ef_search (query-time accuracy/speed) and ef_construction (index build quality).

Dimensionality

Number of dimensions in an embedding vector. Higher = more information but slower search and more storage.

Agentic AI Concepts

Agent

AI system that can autonomously plan, reason, use tools, take actions. Loops: observe → think → act → observe.

ReAct (Reasoning + Acting)

Agent pattern where model alternates between reasoning and taking actions (tool calls).

Tool Use / Function Calling

FM generates a structured request to call an external function. Tool result fed back to FM.

MCP (Model Context Protocol)

Open standard for connecting AI agents to tools and data sources. MCP servers expose tools; MCP clients (agents) consume them.

Multi-Agent Systems

Multiple specialized agents collaborating. One researches, another writes, another reviews.

Action Group

In Bedrock Agents, a set of related actions/tools defined with OpenAPI schemas and Lambda backing.

Human-in-the-Loop

Design pattern where an agent pauses for human review/approval before high-stakes actions. Implemented via Step Functions callback pattern.

Circuit Breaker

Pattern that detects repeated failures and temporarily stops calling the failing service. Prevents cascading failures.

Security & Governance

Prompt Injection

Attack where malicious input tricks the FM into ignoring its system prompt and following attacker instructions.

Jailbreak

Technique to bypass an FM's safety guardrails. Model tricked into generating content it was designed to refuse.

Guardrails

Configurable safety controls filtering FM inputs/outputs. Bedrock Guardrails support 6 categories: topic denial, content filter, word filter, sensitive info (PII), contextual grounding, prompt attack.

Model Card

Documentation describing a model's purpose, capabilities, limitations, training data, evaluation results, intended use. SageMaker supports programmatic model cards.

Data Lineage

Complete history of data: where it came from, how transformed, where it went. Important for compliance and debugging.

Source Attribution

Linking FM-generated content back to source documents. Critical for trust and compliance.

LLM-as-a-Judge

Using one FM to evaluate another FM's outputs. Scalable alternative to human evaluation. Bedrock supports this natively.

Operational Concepts

Provisioned Throughput

Pre-purchased capacity in Bedrock measured in model units. Guarantees consistent performance. Required for custom models.

On-Demand Inference

Pay-per-token, no reserved capacity. Cost-effective for variable/low-volume workloads but subject to throttling.

Cross-Region Inference

Bedrock feature that automatically routes requests to models in other regions when primary region is at capacity. Improves availability.

Model Cascading

Try a cheaper/faster model first; escalate to a more expensive model only when needed. Reduces average cost.

Prompt Caching

Caching processed system prompts and long context prefixes across requests. Big wins when many requests share the same prefix (chatbots).

Semantic Caching

Caching responses for semantically similar queries. Same question in different words → cached answer.

Batch Inference

Processing multiple requests in a single API call. ~50% discount in Bedrock for non-real-time workloads.

Streaming

Delivering FM response tokens incrementally. Improves perceived latency.

Token Drift

Gradual change in model behavior over time. Detected via golden dataset comparison.

Golden Dataset

Curated set of questions with known correct answers used to benchmark FM performance over time. Detects regression and drift.