Why Streaming Matters for AI UX
AI models generate text token by token. Without streaming, your server waits for the entire response, then sends it as one payload. With streaming, the first tokens arrive in under a second and the user sees text appearing as it's generated.
The underlying mechanism is Server-Sent Events (SSE) — the server holds the HTTP connection open and pushes chunks as they arrive. The Anthropic API, OpenAI API, and most other AI providers use SSE for their streaming endpoints.
The relevant latency metric for streamed AI is time to first token (TTFT) — how long before the user sees any output. TTFT is roughly constant regardless of response length, while total response time grows linearly. Streaming makes long responses feel fast.
Streaming with the Anthropic SDK
The Anthropic SDK provides a stream() method that returns an async iterator of events. Each event contains a text delta you append to the output.
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic();
async function streamResponse(userMessage: string): Promise<void> {
const stream = anthropic.messages.stream({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
messages: [{ role: 'user', content: userMessage }],
});
for await (const event of stream) {
if (
event.type === 'content_block_delta' &&
event.delta.type === 'text_delta'
) {
process.stdout.write(event.delta.text);
}
}
const finalMessage = await stream.getFinalMessage();
console.log('\nUsage:', finalMessage.usage);
}
The getFinalMessage() call at the end returns the complete assembled message with usage statistics — input tokens, output tokens, and cache metrics if you're using prompt caching.
Using the Text Stream Helper
For simple text-only use cases, the SDK also exposes a stream.text async iterable that yields only the text deltas:
const stream = anthropic.messages.stream({ ... });
for await (const text of stream.text) {
process.stdout.write(text); // just the text, no event unwrapping
}
Passing Streaming Through a Node.js Backend
In production you typically don't expose your Anthropic API key to the browser. Instead, your frontend calls your own backend, which calls Anthropic and proxies the stream. The browser receives SSE from your server.
Express SSE Endpoint
import express from 'express';
import Anthropic from '@anthropic-ai/sdk';
const app = express();
app.use(express.json());
const anthropic = new Anthropic();
app.post('/api/chat/stream', async (req, res) => {
const { message } = req.body;
// Set SSE headers
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
res.flushHeaders();
const stream = anthropic.messages.stream({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
messages: [{ role: 'user', content: message }],
});
for await (const event of stream) {
if (
event.type === 'content_block_delta' &&
event.delta.type === 'text_delta'
) {
// SSE format: "data: ...\n\n"
res.write(`data: ${JSON.stringify({ text: event.delta.text })}\n\n`);
}
}
res.write('data: [DONE]\n\n');
res.end();
});
The backend endpoint is where you check the user's session, apply per-user rate limits, validate input, and log usage. Never skip this layer — the SSE transport doesn't change the security model.
Handling Client Disconnects
If the user navigates away or closes the browser, the HTTP connection closes. Detect this and abort the upstream stream to avoid continuing to pay for tokens the user will never see:
app.post('/api/chat/stream', async (req, res) => {
const abortController = new AbortController();
req.on('close', () => {
abortController.abort();
});
const stream = anthropic.messages.stream(
{ model: 'claude-sonnet-4-6', max_tokens: 1024, messages: [...] },
{ signal: abortController.signal }
);
try {
for await (const event of stream) { ... }
} catch (err) {
if (err.name !== 'AbortError') throw err;
// Client disconnected — silently exit
}
res.end();
});
Consuming Streams in React
On the frontend, use the browser's fetch API with a ReadableStream reader. The EventSource API only supports GET requests; for POST requests (with a body), read the response stream directly.
async function streamChat(message: string, onChunk: (text: string) => void) {
const response = await fetch('/api/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message }),
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
// Parse SSE lines: "data: {...}\n\n"
for (const line of chunk.split('\n')) {
if (!line.startsWith('data: ')) continue;
const payload = line.slice(6).trim();
if (payload === '[DONE]') return;
try {
const { text } = JSON.parse(payload);
onChunk(text);
} catch {}
}
}
}
React Hook
function useStreamingChat() {
const [output, setOutput] = useState('');
const [isStreaming, setIsStreaming] = useState(false);
const abortRef = useRef<AbortController | null>(null);
const send = useCallback(async (message: string) => {
abortRef.current?.abort();
abortRef.current = new AbortController();
setOutput('');
setIsStreaming(true);
try {
await streamChat(message, (text) => {
setOutput((prev) => prev + text);
});
} finally {
setIsStreaming(false);
}
}, []);
const stop = useCallback(() => {
abortRef.current?.abort();
setIsStreaming(false);
}, []);
return { output, isStreaming, send, stop };
}
SSE vs WebSockets
Both protocols keep a connection open for real-time updates, but they serve different patterns:
| Feature | SSE | WebSockets |
|---|---|---|
| Direction | Server → client only | Bidirectional |
| Protocol | HTTP/1.1 or HTTP/2 | WS/WSS (protocol upgrade) |
| Auto-reconnect | Yes (built into EventSource) | Manual |
| Proxy compatibility | High — standard HTTP | Requires WS-aware proxy |
| Typical AI use case | Streaming a single response | Persistent chat session, collaborative editing |
For most AI streaming use cases — a user sends a message, the AI responds — SSE is the right choice. It's simpler to implement, works through standard HTTP infrastructure, and the one-directional nature matches the pattern well.
Use WebSockets when you need true bidirectionality: the server needs to push unsolicited updates (e.g., a background agent completing a task), or multiple clients need to share a live collaborative state.
Error Handling in Streams
Errors in a streaming context behave differently from request/response errors. The HTTP response has already started (200 OK headers sent) when the error occurs, so you can't change the status code. Instead, send an error event in the stream:
// Server: send an error event in the stream
function sendError(res: Response, message: string) {
res.write(`data: ${JSON.stringify({ error: message })}\n\n`);
res.write('data: [DONE]\n\n');
res.end();
}
// In your handler:
try {
for await (const event of stream) { ... }
} catch (err) {
if (err instanceof Anthropic.APIError) {
sendError(res, `API error: ${err.status}`);
} else {
sendError(res, 'An error occurred. Please try again.');
}
}
// Client: check for error events
const { text, error } = JSON.parse(payload);
if (error) {
showErrorToUser(error);
return;
}
onChunk(text);
Retry on Network Interruption
Network interruptions mid-stream will cause the fetch to throw. Decide at the product level whether to automatically retry (acceptable for read-only queries) or surface the error to the user and let them resend. Automatic retry is risky for write operations or when the model has already partially responded.
Streaming with Tool Use
When your AI calls tools (function calling), the stream emits tool_use blocks in addition to text. You need to handle both:
let toolInput = '';
let currentToolId = '';
for await (const event of stream) {
if (event.type === 'content_block_start') {
if (event.content_block.type === 'tool_use') {
currentToolId = event.content_block.id;
toolInput = '';
}
}
if (event.type === 'content_block_delta') {
if (event.delta.type === 'text_delta') {
process.stdout.write(event.delta.text); // stream text to user
}
if (event.delta.type === 'input_json_delta') {
toolInput += event.delta.partial_json; // accumulate tool args
}
}
if (event.type === 'content_block_stop' && currentToolId) {
const args = JSON.parse(toolInput);
const result = await executeToolCall(currentToolId, args);
// Continue conversation with tool result...
currentToolId = '';
}
}
You can stream text content to the user immediately while accumulating the tool input separately. Once the tool block closes, execute the tool call and continue the conversation.
Production Considerations
Timeouts
Long AI responses can take 30–60 seconds. Set your server and proxy timeouts accordingly — default Express and Nginx timeouts of 30 seconds will cut off long streams. For Nginx:
# In your Nginx location block for the streaming endpoint
proxy_read_timeout 120s;
proxy_send_timeout 120s;
proxy_buffering off; # critical — disables Nginx response buffering
Disabling Buffering
HTTP proxies and load balancers often buffer responses by default. Buffering defeats streaming — the user gets one big batch instead of a trickle. Set X-Accel-Buffering: no on the response, and confirm your infrastructure respects it:
res.setHeader('X-Accel-Buffering', 'no'); // tells Nginx not to buffer
res.setHeader('Cache-Control', 'no-cache');
res.flushHeaders();
Backpressure
If your Node.js server receives tokens faster than the network can flush them to the client, you'll accumulate data in memory. Check res.write()'s return value — it returns false when the write buffer is full, signaling you should pause until the drain event fires:
for await (const event of stream) {
if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
const canWrite = res.write(`data: ${JSON.stringify({ text: event.delta.text })}\n\n`);
if (!canWrite) {
await new Promise((resolve) => res.once('drain', resolve));
}
}
}
For most AI streaming use cases this is a non-issue — tokens arrive slower than they can be sent. But it matters if you're building a high-throughput proxy or aggregating multiple model outputs.
Streaming Implementation Checklist
- SSE headers —
Content-Type: text/event-stream,Cache-Control: no-cache,X-Accel-Buffering: no. - Client disconnect handling — listen for
req.on('close')and abort the upstream stream. - Error events — send errors as SSE data events after the response has started; you can't change the HTTP status code mid-stream.
- Proxy/CDN buffering — confirm your infrastructure does not buffer SSE responses end-to-end.
- Timeouts — server, proxy, and CDN all need timeouts longer than your longest expected response.
- Frontend abort — expose a stop button; wire it to
AbortControlleron the fetch call. - Usage logging — call
getFinalMessage()after the stream completes to log token counts.
Related Guides
Claude API for Developers
System prompts, tool use, prompt caching, and the full API reference for building with Claude.
Multi-Agent Systems and Tool Use
Orchestrating multiple AI calls, tool use patterns, and building reliable agent pipelines.