Stream LLM Tokens to a React UI Without Melting Your Server
For: A mid-senior frontend or full-stack engineer at a seed-to-Series-A SaaS startup who just shipped a ChatGPT-style text generation feature using fetch() and is now watching their Node server's memory climb under concurrent users because they buffered the entire completion before sending it
Your ChatGPT-style feature works beautifully in the demo. One tab, tokens flowing, cursor blinking, product manager nodding. Then ten users hit it simultaneously and your Node process RSS climbs to 2GB, p95 latency triples, and Datadog tells you nothing useful because the event loop isn't blocked — it's just hoarding bytes.
The problem isn't your model, your network, or your React code. It's that you piped openai.chat.completions.create({ stream: true }) straight into res.write() and never asked Node whether the socket was ready to accept more data. The OpenAI SDK happily produces tokens at ~50/sec per stream. Your client's EventSource drains them at the speed of the user's network. The gap is buffered in your server's memory. Multiply by concurrent users.
This tutorial fixes it. We'll build a streaming endpoint that respects backpressure end-to-end, a React hook that consumes it without dropping frames, and a load test that proves the difference. Total: about 120 lines of code.
What you need before starting
- Node.js 20+ (we use the native
ReadableStreamWeb API) - An OpenAI API key with access to
gpt-4o-minior any streaming model - A React 18+ app (Vite or Next.js — both work)
autocannonfor load testing:npm i -g autocannon
The mental model: OpenAI gives you a ReadableStream. Express gives you a Writable response. Between them, you need to await when the writable says "I'm full." That's it. That's the post.
Step 1: Reproduce the broken version first
You have to feel the pain to believe the fix matters. Create server-broken.js:
import express from 'express';
import OpenAI from 'openai';
const app = express();
app.use(express.json());
const openai = new OpenAI();
app.post('/chat-broken', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
const stream = await openai.chat.completions.create({
model: 'gpt-4o-mini',
stream: true,
messages: [{ role: 'user', content: req.body.prompt }],
});
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content || '';
res.write(`data: ${JSON.stringify({ token })}\n\n`);
}
res.write('data: [DONE]\n\n');
res.end();
});
app.listen(3001);
Run it: node server-broken.js. It works. Single user, perfect. Now simulate a slow client with 50 concurrent connections:
autocannon -c 50 -d 30 -m POST \
-H "Content-Type: application/json" \
-b '{"prompt":"Write a 2000 word essay on backpressure"}' \
http://localhost:3001/chat-broken
Watch process.memoryUsage().heapUsed in another terminal. On my MacBook it climbs from 40MB to roughly 600MB and stays there. Why? Each res.write() returns false when the kernel send buffer is full, but we ignored the return value. Node queues the writes in userland. Memory grows linearly with (token rate × concurrent users × client slowness).
Step 2: Add the three lines that fix everything
Create server.js. The only meaningful change is checking res.write()'s return value and waiting for 'drain' when it says no:
import express from 'express';
import OpenAI from 'openai';
const app = express();
app.use(express.json());
const openai = new OpenAI();
function writeWithBackpressure(res, data) {
return new Promise((resolve) => {
if (res.write(data)) return resolve();
res.once('drain', resolve);
});
}
app.post('/chat', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
res.flushHeaders();
const abortController = new AbortController();
req.on('close', () => abortController.abort());
try {
const stream = await openai.chat.completions.create({
model: 'gpt-4o-mini',
stream: true,
messages: [{ role: 'user', content: req.body.prompt }],
}, { signal: abortController.signal });
for await (const chunk of stream) {
if (abortController.signal.aborted) break;
const token = chunk.choices[0]?.delta?.content || '';
if (!token) continue;
await writeWithBackpressure(res, `data: ${JSON.stringify({ token })}\n\n`);
}
await writeWithBackpressure(res, 'data: [DONE]\n\n');
} catch (err) {
if (err.name !== 'AbortError') {
await writeWithBackpressure(res, `data: ${JSON.stringify({ error: err.message })}\n\n`);
}
} finally {
res.end();
}
});
app.listen(3001);
Three things changed:
writeWithBackpressureawaits'drain'when the socket is full. The for-await loop pauses, which means the OpenAI SDK's iterator pauses, which means OpenAI's HTTP/2 stream stops pulling, which propagates backpressure all the way to the source.req.on('close')aborts the OpenAI request when the user closes the tab. Without this, you keep paying for tokens nobody will read.res.flushHeaders()sends headers immediately so the client's EventSource opens fast.
Run the same autocannon test against /chat. Heap stays under 120MB on my machine across 50 concurrent users. Throughput drops slightly per-user when clients are slow — that's correct behavior. You're trading buffered "speed" for actual stability.
Step 3: Build the React consumer
EventSource doesn't support POST, so we use fetch with manual SSE parsing. Create useStreamingChat.ts:
import { useState, useRef, useCallback } from 'react';
export function useStreamingChat() {
const [text, setText] = useState('');
const [isStreaming, setIsStreaming] = useState(false);
const abortRef = useRef<AbortController | null>(null);
const send = useCallback(async (prompt: string) => {
abortRef.current?.abort();
const controller = new AbortController();
abortRef.current = controller;
setText('');
setIsStreaming(true);
try {
const res = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt }),
signal: controller.signal,
});
if (!res.ok || !res.body) throw new Error(`HTTP ${res.status}`);
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n\n');
buffer = lines.pop() || '';
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const payload = line.slice(6);
if (payload === '[DONE]') return;
try {
const { token, error } = JSON.parse(payload);
if (error) throw new Error(error);
if (token) setText((t) => t + token);
} catch { /* ignore parse errors on partial lines */ }
}
}
} finally {
setIsStreaming(false);
}
}, []);
const cancel = useCallback(() => abortRef.current?.abort(), []);
return { text, isStreaming, send, cancel };
}
Three details that matter:
- Buffer split on
\n\n, not\n. SSE messages are double-newline-terminated.buffer.pop()keeps the partial line for the next chunk. setText((t) => t + token). Functional updates avoid stale closures when tokens arrive faster than React commits.AbortControlleron the fetch. When the user clicks "stop" or navigates away, the server'sreq.on('close')fires and the OpenAI request is canceled. End-to-end cancellation.
Step 4: Avoid React's render storm
If your model emits 50 tokens/sec and you call setText on every one, React commits 50 times/sec. Fine for a single message. Painful when you also have a syntax-highlighted markdown renderer downstream.
Two fixes, pick one:
Option A — batch with requestAnimationFrame. Accumulate tokens in a ref, flush once per frame:
const pendingRef = useRef('');
const rafRef = useRef<number | null>(null);
const scheduleFlush = () => {
if (rafRef.current) return;
rafRef.current = requestAnimationFrame(() => {
setText((t) => t + pendingRef.current);
pendingRef.current = '';
rafRef.current = null;
});
};
// in the loop, replace setText(...) with:
pendingRef.current += token;
scheduleFlush();
Option B — render the markdown only on stream completion and show plain text during streaming. Simpler, often better UX. Markdown re-parsing on every token is what actually melts the browser, not React itself.
Step 5: Verify backpressure works end-to-end
Throttle your network in DevTools to "Slow 3G" and run the streaming chat. In the broken version, the server still happily buffers. In the fixed version, you'll see the OpenAI request itself slow down — that's TCP backpressure traveling all the way from the user's flaky wifi to OpenAI's servers. Beautiful.
To confirm in production, log the writableLength on the response. If it consistently grows above ~64KB per connection, your drain logic isn't connected:
setInterval(() => {
console.log('writableLength:', res.writableLength);
}, 1000);
Step 6: Production hardening
A few things that bit us in real deployments:
- Disable proxy buffering. If you're behind Nginx, set
proxy_buffering off;andX-Accel-Buffering: noas a response header. Otherwise Nginx buffers your entire stream and your fancy backpressure does nothing. - Cloudflare and similar CDNs buffer SSE responses on some plans. Test through your actual production edge, not localhost.
- Heartbeat comments. Some load balancers kill idle connections at 30-60 seconds. If a model takes a while to start emitting, send
: ping\n\nevery 15 seconds. Comments are ignored by EventSource parsers. - HTTP/2 concurrent stream limits. Browsers cap concurrent SSE connections per origin to 6 over HTTP/1.1. Use HTTP/2 in production or you'll lock users out of other API calls while a stream is open.
- Token-level rate limiting. Backpressure protects your server's memory. It does not protect your OpenAI bill. Add a per-user token budget at the application layer.
Common errors and how to read them
The stream works locally but stalls in production
99% of the time this is proxy buffering. Check Nginx, your CDN, and any API gateway. Add X-Accel-Buffering: no and verify the response headers actually reach the browser unmodified.
"Maximum call stack size exceeded" or memory still climbing
You probably forgot to await writeWithBackpressure. Without the await, the for-await loop continues at full speed and you're back to the broken version. Also check that res.once('drain') isn't being shadowed by a listener limit warning.
EventSource closes randomly after ~30 seconds
Idle timeout on a load balancer. Add heartbeat comments. AWS ALB defaults to 60s, GCP to 30s, Heroku to 30s.
Tokens arrive in chunks instead of one-by-one
Either Node is buffering small writes (call res.flushHeaders() early and ensure no compression middleware is wrapping the response — compression() will batch your SSE), or your client is behind a corporate proxy doing the same. Compression and SSE don't mix; exclude streaming routes from your compression middleware.
Aborting the fetch doesn't stop the OpenAI charge
Your server isn't propagating the abort. Confirm req.on('close') fires and that you're passing the AbortSignal into the OpenAI SDK call. Without that signal, the SDK keeps reading the stream to completion even after the client is gone.
What this approach is bad at
Backpressure-aware SSE is the right default for chat UIs, but it has limits worth naming:
- It's still one-way. If you need the client to send mid-stream signals (regenerate, branch, edit), you want WebSockets or the newer WebTransport. SSE plus a separate POST endpoint is fine for most cases but feels clunky for collaborative tools.
- It doesn't help with cold starts. If your first token latency is 4 seconds, no amount of streaming polish will hide it. That's a model selection or prompt caching problem.
- Reconnection is your problem. Native EventSource reconnects automatically. Fetch-based SSE doesn't — if you need resumable streams across network blips, you're implementing
Last-Event-IDhandling and server-side replay yourself. - Browser DevTools are mediocre at SSE. The Network tab shows the stream as a single pending request. Use the EventStream tab in Chrome or just
console.login your reader loop.
For most SaaS chat features — copilots, support assistants, generation UIs — this pattern is what you want. We've shipped variants of it on AI features across AI products at CodeNicely, including healthcare workflows like HealthPotli's drug interaction assistant, where keeping the server stable under unpredictable token rates was non-negotiable.
Frequently Asked Questions
Should I use Server-Sent Events or WebSockets for LLM streaming?
SSE for almost all chat UIs. It's HTTP, it works through every proxy and CDN with minor config, and it has built-in reconnection semantics. Use WebSockets when you need bidirectional mid-stream messages — collaborative editors, voice, agents that accept tool-call interrupts.
Why does my Node server's memory grow even when I'm using streaming?
Because res.write() returns a boolean indicating whether the socket accepted the data, and most tutorials ignore it. When the return is false, Node buffers the bytes in userland and you must wait for the 'drain' event before writing more. Without that wait, slow clients cause unbounded memory growth — which is exactly the bug this post fixes.
Does the Vercel AI SDK handle backpressure correctly?
The Vercel AI SDK uses the Web Streams API, which propagates backpressure correctly when you use it end-to-end. The bug usually appears when teams mix the SDK's StreamingTextResponse with custom Express middleware that doesn't honor backpressure, or when they're on a Node runtime older than 18 where the Web Streams shim is incomplete.
How do I test streaming endpoints under realistic load?
Autocannon or k6 with a slow-client simulation. The trick is throttling the consumer side — fast clients won't reproduce the bug. Run a load test where 20% of clients sleep 100ms between reads, and watch RSS over a 5-minute window. If memory grows monotonically, your backpressure is broken.
What's the right way to estimate cost and timeline for adding streaming AI to my product?
It depends on your existing stack, model choice, and how much UX polish (markdown, code highlighting, citations, cancel/regenerate) you need. For a tailored estimate, talk to CodeNicely for a personalized assessment based on your specific architecture and traffic profile.
Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.
_1751731246795-BygAaJJK.png)