SaaS technology
Startups SaaS May 9, 2026 • 11 min read

Stream LLM Tokens to a React UI Without Melting Your Server

For: A mid-senior frontend or full-stack engineer at a seed-to-Series-A SaaS startup who just shipped a ChatGPT-style text generation feature using fetch() and is now watching their Node server's memory climb under concurrent users because they buffered the entire completion before sending it

Your ChatGPT-style feature works beautifully in the demo. One tab, tokens flowing, cursor blinking, product manager nodding. Then ten users hit it simultaneously and your Node process RSS climbs to 2GB, p95 latency triples, and Datadog tells you nothing useful because the event loop isn't blocked — it's just hoarding bytes.

The problem isn't your model, your network, or your React code. It's that you piped openai.chat.completions.create({ stream: true }) straight into res.write() and never asked Node whether the socket was ready to accept more data. The OpenAI SDK happily produces tokens at ~50/sec per stream. Your client's EventSource drains them at the speed of the user's network. The gap is buffered in your server's memory. Multiply by concurrent users.

This tutorial fixes it. We'll build a streaming endpoint that respects backpressure end-to-end, a React hook that consumes it without dropping frames, and a load test that proves the difference. Total: about 120 lines of code.

What you need before starting

The mental model: OpenAI gives you a ReadableStream. Express gives you a Writable response. Between them, you need to await when the writable says "I'm full." That's it. That's the post.

Step 1: Reproduce the broken version first

You have to feel the pain to believe the fix matters. Create server-broken.js:

import express from 'express';
import OpenAI from 'openai';

const app = express();
app.use(express.json());
const openai = new OpenAI();

app.post('/chat-broken', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    stream: true,
    messages: [{ role: 'user', content: req.body.prompt }],
  });

  for await (const chunk of stream) {
    const token = chunk.choices[0]?.delta?.content || '';
    res.write(`data: ${JSON.stringify({ token })}\n\n`);
  }
  res.write('data: [DONE]\n\n');
  res.end();
});

app.listen(3001);

Run it: node server-broken.js. It works. Single user, perfect. Now simulate a slow client with 50 concurrent connections:

autocannon -c 50 -d 30 -m POST \
  -H "Content-Type: application/json" \
  -b '{"prompt":"Write a 2000 word essay on backpressure"}' \
  http://localhost:3001/chat-broken

Watch process.memoryUsage().heapUsed in another terminal. On my MacBook it climbs from 40MB to roughly 600MB and stays there. Why? Each res.write() returns false when the kernel send buffer is full, but we ignored the return value. Node queues the writes in userland. Memory grows linearly with (token rate × concurrent users × client slowness).

Step 2: Add the three lines that fix everything

Create server.js. The only meaningful change is checking res.write()'s return value and waiting for 'drain' when it says no:

import express from 'express';
import OpenAI from 'openai';

const app = express();
app.use(express.json());
const openai = new OpenAI();

function writeWithBackpressure(res, data) {
  return new Promise((resolve) => {
    if (res.write(data)) return resolve();
    res.once('drain', resolve);
  });
}

app.post('/chat', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
  res.flushHeaders();

  const abortController = new AbortController();
  req.on('close', () => abortController.abort());

  try {
    const stream = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      stream: true,
      messages: [{ role: 'user', content: req.body.prompt }],
    }, { signal: abortController.signal });

    for await (const chunk of stream) {
      if (abortController.signal.aborted) break;
      const token = chunk.choices[0]?.delta?.content || '';
      if (!token) continue;
      await writeWithBackpressure(res, `data: ${JSON.stringify({ token })}\n\n`);
    }
    await writeWithBackpressure(res, 'data: [DONE]\n\n');
  } catch (err) {
    if (err.name !== 'AbortError') {
      await writeWithBackpressure(res, `data: ${JSON.stringify({ error: err.message })}\n\n`);
    }
  } finally {
    res.end();
  }
});

app.listen(3001);

Three things changed:

  1. writeWithBackpressure awaits 'drain' when the socket is full. The for-await loop pauses, which means the OpenAI SDK's iterator pauses, which means OpenAI's HTTP/2 stream stops pulling, which propagates backpressure all the way to the source.
  2. req.on('close') aborts the OpenAI request when the user closes the tab. Without this, you keep paying for tokens nobody will read.
  3. res.flushHeaders() sends headers immediately so the client's EventSource opens fast.

Run the same autocannon test against /chat. Heap stays under 120MB on my machine across 50 concurrent users. Throughput drops slightly per-user when clients are slow — that's correct behavior. You're trading buffered "speed" for actual stability.

Step 3: Build the React consumer

EventSource doesn't support POST, so we use fetch with manual SSE parsing. Create useStreamingChat.ts:

import { useState, useRef, useCallback } from 'react';

export function useStreamingChat() {
  const [text, setText] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);
  const abortRef = useRef<AbortController | null>(null);

  const send = useCallback(async (prompt: string) => {
    abortRef.current?.abort();
    const controller = new AbortController();
    abortRef.current = controller;
    setText('');
    setIsStreaming(true);

    try {
      const res = await fetch('/api/chat', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ prompt }),
        signal: controller.signal,
      });
      if (!res.ok || !res.body) throw new Error(`HTTP ${res.status}`);

      const reader = res.body.getReader();
      const decoder = new TextDecoder();
      let buffer = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('\n\n');
        buffer = lines.pop() || '';

        for (const line of lines) {
          if (!line.startsWith('data: ')) continue;
          const payload = line.slice(6);
          if (payload === '[DONE]') return;
          try {
            const { token, error } = JSON.parse(payload);
            if (error) throw new Error(error);
            if (token) setText((t) => t + token);
          } catch { /* ignore parse errors on partial lines */ }
        }
      }
    } finally {
      setIsStreaming(false);
    }
  }, []);

  const cancel = useCallback(() => abortRef.current?.abort(), []);
  return { text, isStreaming, send, cancel };
}

Three details that matter:

Step 4: Avoid React's render storm

If your model emits 50 tokens/sec and you call setText on every one, React commits 50 times/sec. Fine for a single message. Painful when you also have a syntax-highlighted markdown renderer downstream.

Two fixes, pick one:

Option A — batch with requestAnimationFrame. Accumulate tokens in a ref, flush once per frame:

const pendingRef = useRef('');
const rafRef = useRef<number | null>(null);

const scheduleFlush = () => {
  if (rafRef.current) return;
  rafRef.current = requestAnimationFrame(() => {
    setText((t) => t + pendingRef.current);
    pendingRef.current = '';
    rafRef.current = null;
  });
};

// in the loop, replace setText(...) with:
pendingRef.current += token;
scheduleFlush();

Option B — render the markdown only on stream completion and show plain text during streaming. Simpler, often better UX. Markdown re-parsing on every token is what actually melts the browser, not React itself.

Step 5: Verify backpressure works end-to-end

Throttle your network in DevTools to "Slow 3G" and run the streaming chat. In the broken version, the server still happily buffers. In the fixed version, you'll see the OpenAI request itself slow down — that's TCP backpressure traveling all the way from the user's flaky wifi to OpenAI's servers. Beautiful.

To confirm in production, log the writableLength on the response. If it consistently grows above ~64KB per connection, your drain logic isn't connected:

setInterval(() => {
  console.log('writableLength:', res.writableLength);
}, 1000);

Step 6: Production hardening

A few things that bit us in real deployments:

Common errors and how to read them

The stream works locally but stalls in production

99% of the time this is proxy buffering. Check Nginx, your CDN, and any API gateway. Add X-Accel-Buffering: no and verify the response headers actually reach the browser unmodified.

"Maximum call stack size exceeded" or memory still climbing

You probably forgot to await writeWithBackpressure. Without the await, the for-await loop continues at full speed and you're back to the broken version. Also check that res.once('drain') isn't being shadowed by a listener limit warning.

EventSource closes randomly after ~30 seconds

Idle timeout on a load balancer. Add heartbeat comments. AWS ALB defaults to 60s, GCP to 30s, Heroku to 30s.

Tokens arrive in chunks instead of one-by-one

Either Node is buffering small writes (call res.flushHeaders() early and ensure no compression middleware is wrapping the response — compression() will batch your SSE), or your client is behind a corporate proxy doing the same. Compression and SSE don't mix; exclude streaming routes from your compression middleware.

Aborting the fetch doesn't stop the OpenAI charge

Your server isn't propagating the abort. Confirm req.on('close') fires and that you're passing the AbortSignal into the OpenAI SDK call. Without that signal, the SDK keeps reading the stream to completion even after the client is gone.

What this approach is bad at

Backpressure-aware SSE is the right default for chat UIs, but it has limits worth naming:

For most SaaS chat features — copilots, support assistants, generation UIs — this pattern is what you want. We've shipped variants of it on AI features across AI products at CodeNicely, including healthcare workflows like HealthPotli's drug interaction assistant, where keeping the server stable under unpredictable token rates was non-negotiable.

Frequently Asked Questions

Should I use Server-Sent Events or WebSockets for LLM streaming?

SSE for almost all chat UIs. It's HTTP, it works through every proxy and CDN with minor config, and it has built-in reconnection semantics. Use WebSockets when you need bidirectional mid-stream messages — collaborative editors, voice, agents that accept tool-call interrupts.

Why does my Node server's memory grow even when I'm using streaming?

Because res.write() returns a boolean indicating whether the socket accepted the data, and most tutorials ignore it. When the return is false, Node buffers the bytes in userland and you must wait for the 'drain' event before writing more. Without that wait, slow clients cause unbounded memory growth — which is exactly the bug this post fixes.

Does the Vercel AI SDK handle backpressure correctly?

The Vercel AI SDK uses the Web Streams API, which propagates backpressure correctly when you use it end-to-end. The bug usually appears when teams mix the SDK's StreamingTextResponse with custom Express middleware that doesn't honor backpressure, or when they're on a Node runtime older than 18 where the Web Streams shim is incomplete.

How do I test streaming endpoints under realistic load?

Autocannon or k6 with a slow-client simulation. The trick is throttling the consumer side — fast clients won't reproduce the bug. Run a load test where 20% of clients sleep 100ms between reads, and watch RSS over a 5-minute window. If memory grows monotonically, your backpressure is broken.

What's the right way to estimate cost and timeline for adding streaming AI to my product?

It depends on your existing stack, model choice, and how much UX polish (markdown, code highlighting, citations, cancel/regenerate) you need. For a tailored estimate, talk to CodeNicely for a personalized assessment based on your specific architecture and traffic profile.

Found this useful? CodeNicely publishes engineering and product playbooks weekly. Browse the archive or tell us what you're building.