Architecture diagram

— Streaming chat · tokens flow token-by-token over a persistent connection —
👤 User chat interface user message tokens stream back WEBSOCKET API Gateway persistent connection bidirectional CONNECTION STATE DynamoDB — connection IDs HANDLER Lambda (or ECS for long-running) STREAMING API Bedrock FM InvokeModelWith ResponseStream tok tok tok tokens as generated HISTORY DynamoDB load / save chat WHY STREAMING MATTERS Users see the first token in ~300ms instead of waiting 5-10s for the full response. Perceived latency ≠ total latency LEGEND user message (in) tokens stream (out) state store

How data flows

The client opens a WebSocket connection to API Gateway — persistent, bidirectional. The connection ID is stored in DynamoDB so the backend can find the user later. When the user sends a message, it goes through the WebSocket to a Lambda handler, which builds the prompt (including conversation history from DynamoDB) and calls InvokeModelWithResponseStream.

Bedrock starts emitting tokens as soon as the model generates them. The handler forwards each token back over the same WebSocket to the user. First token appears in ~300ms; the full answer trickles in over the next few seconds. The user perceives the system as fast because they're reading tokens as they arrive.

AWS services used

API Gateway WebSocket APIThe persistent bidirectional channel. Connect / message / disconnect events are routed to Lambda handlers.
AWS Lambda (or ECS)Handles incoming messages, invokes Bedrock streaming API, forwards tokens back. Use ECS when long-running connections or higher concurrency than Lambda allows.
Bedrock InvokeModelWithResponseStreamThe streaming API — returns tokens incrementally as they're generated, not all at once.
DynamoDB (connection IDs)Maps connection IDs to users / sessions. API Gateway WebSocket requires this for multi-user routing.
DynamoDB (history)Stores conversation turns so the handler can rebuild context across messages.
CloudFront (optional)For global users — edge locations reduce the TCP / TLS handshake latency to the WebSocket.
CognitoAuth. The WebSocket connect handler validates the token before accepting the connection.
Bedrock GuardrailsStreaming-aware Guardrails available — applied to both input and the streaming output.

When to use this pattern

Use Streaming Chat when…

  • Interactive chat or conversation UIAnywhere the user expects a chat-like experience. Streaming is the baseline expectation for modern AI chat.
  • Responses are long (100+ tokens)Long answers feel slow without streaming. With streaming, the first word shows up fast and reading covers the rest of the latency.
  • Perceived latency matters more than total latencyUser patience runs out at ~2 seconds of blank screen. Streaming buys you 30+ seconds of generation time because they're watching it happen.
  • You need bidirectional control (stop, regenerate)WebSocket supports user interrupts mid-generation — "stop" button, "try again" mid-stream.
  • Multi-turn conversationsPersistent connection pairs naturally with ongoing dialog. State stays warm.

Do NOT use Streaming when…

  • Response must be complete before useParsing JSON, running validation, using output as input to another API. Stream the tokens and you'll get a partial result you can't work with yet.
  • Very short responses (classification, yes/no)If the response is 5 tokens total, streaming adds infrastructure complexity without user-perceived benefit. Use synchronous.
  • Background / batch workloadsNobody's watching — no need for streaming. Use Pattern 7: Event-Driven with batch inference.
  • Stateless REST API design requiredWebSocket is stateful by nature. If you must have pure REST, use chunked transfer encoding instead (less common).
  • Team doesn't own connection managementWebSocket connections need careful handling for timeouts, reconnects, load balancing. Not trivial operationally.

Exam angle

Pattern-match shortcuts When a stem mentions "chat interface," "real-time response," "as the model generates," or "user sees the response immediately," streaming is the answer. The correct option combines WebSocket API + InvokeModelWithResponseStream.
The InvokeModel vs InvokeModelWithResponseStream trap InvokeModel is synchronous — the FM generates the complete response before returning. InvokeModelWithResponseStream streams tokens. If the stem says "streaming" or "tokens as generated," it must be the streaming API. Don't fall for options that pair streaming with InvokeModel.
WebSocket vs Server-Sent Events (SSE) Both can stream. WebSocket is bidirectional — user can interrupt, regenerate, send follow-ups on the same connection. SSE is one-way (server → client) — simpler, but can't handle user interruptions on the same stream. For chat UX, WebSocket wins. For notifications-only streaming, SSE is fine.

Keywords that point here

chat interface streaming real-time response WebSocket InvokeModelWithResponseStream token-by-token perceived latency conversational

Related patterns

For non-interactive async workloads, use Pattern 7: Event-Driven.
Streaming pairs with Pattern 1: Basic RAG — stream the RAG-generated answer.
Safety during streaming: Pattern 10: Defense-in-Depth.