Streaming AI Responses: SSE, Real-Time UI, and Production Patterns

Why Streaming Matters for AI UX

AI models generate text token by token. Without streaming, your server waits for the entire response, then sends it as one payload. With streaming, the first tokens arrive in under a second and the user sees text appearing as it's generated.

The underlying mechanism is Server-Sent Events (SSE) — the server holds the HTTP connection open and pushes chunks as they arrive. The Anthropic API, OpenAI API, and most other AI providers use SSE for their streaming endpoints.

Time to First Token

The relevant latency metric for streamed AI is time to first token (TTFT) — how long before the user sees any output. TTFT is roughly constant regardless of response length, while total response time grows linearly. Streaming makes long responses feel fast.

Streaming with the Anthropic SDK

The Anthropic SDK provides a stream() method that returns an async iterator of events. Each event contains a text delta you append to the output.

import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

async function streamResponse(userMessage: string): Promise<void> {
  const stream = anthropic.messages.stream({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    messages: [{ role: 'user', content: userMessage }],
  });

  for await (const event of stream) {
    if (
      event.type === 'content_block_delta' &&
      event.delta.type === 'text_delta'
    ) {
      process.stdout.write(event.delta.text);
    }
  }

  const finalMessage = await stream.getFinalMessage();
  console.log('\nUsage:', finalMessage.usage);
}

The getFinalMessage() call at the end returns the complete assembled message with usage statistics — input tokens, output tokens, and cache metrics if you're using prompt caching.

Using the Text Stream Helper

For simple text-only use cases, the SDK also exposes a stream.text async iterable that yields only the text deltas:

const stream = anthropic.messages.stream({ ... });

for await (const text of stream.text) {
  process.stdout.write(text);   // just the text, no event unwrapping
}

Passing Streaming Through a Node.js Backend

In production you typically don't expose your Anthropic API key to the browser. Instead, your frontend calls your own backend, which calls Anthropic and proxies the stream. The browser receives SSE from your server.

Express SSE Endpoint

import express from 'express';
import Anthropic from '@anthropic-ai/sdk';

const app = express();
app.use(express.json());

const anthropic = new Anthropic();

app.post('/api/chat/stream', async (req, res) => {
  const { message } = req.body;

  // Set SSE headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
  res.flushHeaders();

  const stream = anthropic.messages.stream({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    messages: [{ role: 'user', content: message }],
  });

  for await (const event of stream) {
    if (
      event.type === 'content_block_delta' &&
      event.delta.type === 'text_delta'
    ) {
      // SSE format: "data: ...\n\n"
      res.write(`data: ${JSON.stringify({ text: event.delta.text })}\n\n`);
    }
  }

  res.write('data: [DONE]\n\n');
  res.end();
});

Proxy Auth and Rate Limiting Here

The backend endpoint is where you check the user's session, apply per-user rate limits, validate input, and log usage. Never skip this layer — the SSE transport doesn't change the security model.

Handling Client Disconnects

If the user navigates away or closes the browser, the HTTP connection closes. Detect this and abort the upstream stream to avoid continuing to pay for tokens the user will never see:

app.post('/api/chat/stream', async (req, res) => {
  const abortController = new AbortController();

  req.on('close', () => {
    abortController.abort();
  });

  const stream = anthropic.messages.stream(
    { model: 'claude-sonnet-4-6', max_tokens: 1024, messages: [...] },
    { signal: abortController.signal }
  );

  try {
    for await (const event of stream) { ... }
  } catch (err) {
    if (err.name !== 'AbortError') throw err;
    // Client disconnected — silently exit
  }

  res.end();
});

Consuming Streams in React

On the frontend, use the browser's fetch API with a ReadableStream reader. The EventSource API only supports GET requests; for POST requests (with a body), read the response stream directly.

async function streamChat(message: string, onChunk: (text: string) => void) {
  const response = await fetch('/api/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message }),
  });

  if (!response.ok) throw new Error(`HTTP ${response.status}`);

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value, { stream: true });

    // Parse SSE lines: "data: {...}\n\n"
    for (const line of chunk.split('\n')) {
      if (!line.startsWith('data: ')) continue;
      const payload = line.slice(6).trim();
      if (payload === '[DONE]') return;
      try {
        const { text } = JSON.parse(payload);
        onChunk(text);
      } catch {}
    }
  }
}

React Hook

function useStreamingChat() {
  const [output, setOutput] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);
  const abortRef = useRef<AbortController | null>(null);

  const send = useCallback(async (message: string) => {
    abortRef.current?.abort();
    abortRef.current = new AbortController();
    setOutput('');
    setIsStreaming(true);

    try {
      await streamChat(message, (text) => {
        setOutput((prev) => prev + text);
      });
    } finally {
      setIsStreaming(false);
    }
  }, []);

  const stop = useCallback(() => {
    abortRef.current?.abort();
    setIsStreaming(false);
  }, []);

  return { output, isStreaming, send, stop };
}

SSE vs WebSockets

Both protocols keep a connection open for real-time updates, but they serve different patterns:

Feature	SSE	WebSockets
Direction	Server → client only	Bidirectional
Protocol	HTTP/1.1 or HTTP/2	WS/WSS (protocol upgrade)
Auto-reconnect	Yes (built into EventSource)	Manual
Proxy compatibility	High — standard HTTP	Requires WS-aware proxy
Typical AI use case	Streaming a single response	Persistent chat session, collaborative editing

For most AI streaming use cases — a user sends a message, the AI responds — SSE is the right choice. It's simpler to implement, works through standard HTTP infrastructure, and the one-directional nature matches the pattern well.

Use WebSockets when you need true bidirectionality: the server needs to push unsolicited updates (e.g., a background agent completing a task), or multiple clients need to share a live collaborative state.

Error Handling in Streams

Errors in a streaming context behave differently from request/response errors. The HTTP response has already started (200 OK headers sent) when the error occurs, so you can't change the status code. Instead, send an error event in the stream:

// Server: send an error event in the stream
function sendError(res: Response, message: string) {
  res.write(`data: ${JSON.stringify({ error: message })}\n\n`);
  res.write('data: [DONE]\n\n');
  res.end();
}

// In your handler:
try {
  for await (const event of stream) { ... }
} catch (err) {
  if (err instanceof Anthropic.APIError) {
    sendError(res, `API error: ${err.status}`);
  } else {
    sendError(res, 'An error occurred. Please try again.');
  }
}

// Client: check for error events
const { text, error } = JSON.parse(payload);
if (error) {
  showErrorToUser(error);
  return;
}
onChunk(text);

Retry on Network Interruption

Network interruptions mid-stream will cause the fetch to throw. Decide at the product level whether to automatically retry (acceptable for read-only queries) or surface the error to the user and let them resend. Automatic retry is risky for write operations or when the model has already partially responded.

Streaming with Tool Use

When your AI calls tools (function calling), the stream emits tool_use blocks in addition to text. You need to handle both:

let toolInput = '';
let currentToolId = '';

for await (const event of stream) {
  if (event.type === 'content_block_start') {
    if (event.content_block.type === 'tool_use') {
      currentToolId = event.content_block.id;
      toolInput = '';
    }
  }

  if (event.type === 'content_block_delta') {
    if (event.delta.type === 'text_delta') {
      process.stdout.write(event.delta.text);   // stream text to user
    }
    if (event.delta.type === 'input_json_delta') {
      toolInput += event.delta.partial_json;    // accumulate tool args
    }
  }

  if (event.type === 'content_block_stop' && currentToolId) {
    const args = JSON.parse(toolInput);
    const result = await executeToolCall(currentToolId, args);
    // Continue conversation with tool result...
    currentToolId = '';
  }
}

You can stream text content to the user immediately while accumulating the tool input separately. Once the tool block closes, execute the tool call and continue the conversation.

Production Considerations

Timeouts

Long AI responses can take 30–60 seconds. Set your server and proxy timeouts accordingly — default Express and Nginx timeouts of 30 seconds will cut off long streams. For Nginx:

# In your Nginx location block for the streaming endpoint
proxy_read_timeout 120s;
proxy_send_timeout 120s;
proxy_buffering off;      # critical — disables Nginx response buffering

Disabling Buffering

HTTP proxies and load balancers often buffer responses by default. Buffering defeats streaming — the user gets one big batch instead of a trickle. Set X-Accel-Buffering: no on the response, and confirm your infrastructure respects it:

res.setHeader('X-Accel-Buffering', 'no');   // tells Nginx not to buffer
res.setHeader('Cache-Control', 'no-cache');
res.flushHeaders();

Backpressure

If your Node.js server receives tokens faster than the network can flush them to the client, you'll accumulate data in memory. Check res.write()'s return value — it returns false when the write buffer is full, signaling you should pause until the drain event fires:

for await (const event of stream) {
  if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
    const canWrite = res.write(`data: ${JSON.stringify({ text: event.delta.text })}\n\n`);
    if (!canWrite) {
      await new Promise((resolve) => res.once('drain', resolve));
    }
  }
}

For most AI streaming use cases this is a non-issue — tokens arrive slower than they can be sent. But it matters if you're building a high-throughput proxy or aggregating multiple model outputs.

Streaming Implementation Checklist

SSE headers — Content-Type: text/event-stream, Cache-Control: no-cache, X-Accel-Buffering: no.
Client disconnect handling — listen for req.on('close') and abort the upstream stream.
Error events — send errors as SSE data events after the response has started; you can't change the HTTP status code mid-stream.
Proxy/CDN buffering — confirm your infrastructure does not buffer SSE responses end-to-end.
Timeouts — server, proxy, and CDN all need timeouts longer than your longest expected response.
Frontend abort — expose a stop button; wire it to AbortController on the fetch call.
Usage logging — call getFinalMessage() after the stream completes to log token counts.

Related Guides

Claude API for Developers

System prompts, tool use, prompt caching, and the full API reference for building with Claude.

Multi-Agent Systems and Tool Use

Orchestrating multiple AI calls, tool use patterns, and building reliable agent pipelines.

Back to Home

→