Skip to main content

Command Palette

Search for a command to run...

Streaming AI Responses with Server-Sent Events: A Complete Developer's Guide

Updated
4 min read

Why Streaming Matters

Standard HTTP requests are request-response: you send a prompt, the server generates the entire response, then sends it back in one chunk. For a 500-word answer, that could mean 5-15 seconds of staring at a loading spinner.

Streaming flips this model. The server sends tokens as they're generated, keeping the connection open. The user sees the first word within milliseconds, not seconds.

The perceptual performance gain is massive. Research from Nielsen Norman Group shows that users perceive interfaces responding within 100ms as "instant." Streaming hits that threshold for the first token; batch responses never do.

SSE: The Protocol

Server-Sent Events use a dead-simple format over HTTP:

data: {"token": "Hello"}

data: {"token": " world"}

data: {"token": "!"}

data: [DONE]

Each event is prefixed with data: , separated by double newlines, and the stream ends with [DONE]. That's it. No WebSocket handshake, no binary framing, no complex protocol negotiation.

SSE beats WebSockets for this use case because:

  • One-directional: We only need server-to-client streaming

  • Built on HTTP: Works through proxies, load balancers, and CDNs

  • Auto-reconnect: Browsers reconnect automatically on disconnect

  • Simple parsing: Line-based text protocol, not a binary frame format

Implementing the Client (Python)

Here's a streaming client in Python using the httpx library:

import httpx
import json

async def stream_completion(prompt: str, api_key: str):
    async with httpx.AsyncClient() as client:
        async with client.stream(
            "POST",
            "https://api.aiwave.live/v1/chat/completions",
            headers={
                "Authorization": f"Bearer {api_key}",
                "Content-Type": "application/json",
            },
            json={
                "model": "deepseek-chat",
                "messages": [{"role": "user", "content": prompt}],
                "stream": True,
            },
        ) as response:
            async for line in response.aiter_lines():
                if line.startswith("data: ") and line != "data: [DONE]":
                    chunk = json.loads(line[6:])
                    token = chunk["choices"][0]["delta"].get("content", "")
                    if token:
                        print(token, end="", flush=True)

The key is stream=True in the request body and client.stream() on the HTTP layer. The API responds with SSE-formatted chunks.

Building the Frontend (JavaScript)

On the browser side, the Fetch API with ReadableStream is your friend:

async function streamChat(prompt) {
  const response = await fetch('https://api.aiwave.live/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer ' + apiKey,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      model: 'deepseek-chat',
      messages: [{ role: 'user', content: prompt }],
      stream: true,
    }),
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let buffer = '';

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split('\n');
    buffer = lines.pop();

    for (const line of lines) {
      if (line.startsWith('data: ') && line !== 'data: [DONE]') {
        const chunk = JSON.parse(line.slice(6));
        const token = chunk.choices[0]?.delta?.content || '';
        if (token) {
          document.getElementById('output').textContent += token;
        }
      }
    }
  }
}

No libraries. No abstractions. Just the Fetch API reading chunks as they arrive.

Handling Errors and Reconnection

Streaming introduces failure modes that batch APIs don't have:

  1. Mid-stream timeouts: The server stops sending tokens but doesn't close the connection.

  2. Partial chunks: A data: line might be split across TCP packets — hence the buffer in the code above.

  3. Rate limits: You might hit a rate limit mid-stream, cutting off the response.

Best practices:

  • Set a read timeout (e.g., 30 seconds between tokens). If no token arrives within that window, close and reconnect.

  • Buffer aggressively. Never assume a single TCP packet contains a complete SSE event.

  • Implement exponential backoff for reconnection attempts.

Cost Optimization Through Streaming

Streaming doesn't just improve UX — it can save money.

With models like DeepSeek V4 Pro, you're billed per token. When you stream, you can implement early stopping: if the first 50 tokens make it clear the response is going off-track, you can close the connection and retry with a better prompt.

This is especially powerful for AIWave users accessing DeepSeek's API. Instead of paying for a full 500-token hallucination, you cut it off at 50 tokens, adjust your prompt, and retry. Over thousands of requests, this adds up to significant savings.

A Note on Edge Functions

If you're deploying your streaming endpoint on Cloudflare Workers, Vercel Edge Functions, or Deno Deploy, make sure your runtime supports streaming responses. All three do, but the patterns differ:

  • Cloudflare Workers: Return a Response with a ReadableStream body

  • Vercel Edge: Use the runtime: 'edge' export and return a streaming Response

  • Deno Deploy: Native Response with ReadableStream

The key gotcha: don't buffer the entire upstream response before returning. Pipe the upstream stream directly as your response body.

Conclusion

SSE streaming is the difference between an AI app that feels alive and one that feels like it's thinking too hard. The protocol is simple, the implementation is straightforward, and the UX improvement is immediate.

Start with the code snippets above, wire them up to your favorite LLM provider, and you'll have a streaming chat interface running in under an hour. Your users will notice.