Streaming AI Responses with Server-Sent Events: A Complete Developer's Guide
Why Streaming Matters
Standard HTTP requests are request-response: you send a prompt, the server generates the entire response, then sends it back in one chunk. For a 500-word answer, that could mean 5-15 seconds of staring at a loading spinner.
Streaming flips this model. The server sends tokens as they're generated, keeping the connection open. The user sees the first word within milliseconds, not seconds.
The perceptual performance gain is massive. Research from Nielsen Norman Group shows that users perceive interfaces responding within 100ms as "instant." Streaming hits that threshold for the first token; batch responses never do.
SSE: The Protocol
Server-Sent Events use a dead-simple format over HTTP:
data: {"token": "Hello"}
data: {"token": " world"}
data: {"token": "!"}
data: [DONE]
Each event is prefixed with data: , separated by double newlines, and the stream ends with [DONE]. That's it. No WebSocket handshake, no binary framing, no complex protocol negotiation.
SSE beats WebSockets for this use case because:
One-directional: We only need server-to-client streaming
Built on HTTP: Works through proxies, load balancers, and CDNs
Auto-reconnect: Browsers reconnect automatically on disconnect
Simple parsing: Line-based text protocol, not a binary frame format
Implementing the Client (Python)
Here's a streaming client in Python using the httpx library:
import httpx
import json
async def stream_completion(prompt: str, api_key: str):
async with httpx.AsyncClient() as client:
async with client.stream(
"POST",
"https://api.aiwave.live/v1/chat/completions",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
},
json={
"model": "deepseek-chat",
"messages": [{"role": "user", "content": prompt}],
"stream": True,
},
) as response:
async for line in response.aiter_lines():
if line.startswith("data: ") and line != "data: [DONE]":
chunk = json.loads(line[6:])
token = chunk["choices"][0]["delta"].get("content", "")
if token:
print(token, end="", flush=True)
The key is stream=True in the request body and client.stream() on the HTTP layer. The API responds with SSE-formatted chunks.
Building the Frontend (JavaScript)
On the browser side, the Fetch API with ReadableStream is your friend:
async function streamChat(prompt) {
const response = await fetch('https://api.aiwave.live/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': 'Bearer ' + apiKey,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'deepseek-chat',
messages: [{ role: 'user', content: prompt }],
stream: true,
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop();
for (const line of lines) {
if (line.startsWith('data: ') && line !== 'data: [DONE]') {
const chunk = JSON.parse(line.slice(6));
const token = chunk.choices[0]?.delta?.content || '';
if (token) {
document.getElementById('output').textContent += token;
}
}
}
}
}
No libraries. No abstractions. Just the Fetch API reading chunks as they arrive.
Handling Errors and Reconnection
Streaming introduces failure modes that batch APIs don't have:
Mid-stream timeouts: The server stops sending tokens but doesn't close the connection.
Partial chunks: A
data:line might be split across TCP packets — hence the buffer in the code above.Rate limits: You might hit a rate limit mid-stream, cutting off the response.
Best practices:
Set a read timeout (e.g., 30 seconds between tokens). If no token arrives within that window, close and reconnect.
Buffer aggressively. Never assume a single TCP packet contains a complete SSE event.
Implement exponential backoff for reconnection attempts.
Cost Optimization Through Streaming
Streaming doesn't just improve UX — it can save money.
With models like DeepSeek V4 Pro, you're billed per token. When you stream, you can implement early stopping: if the first 50 tokens make it clear the response is going off-track, you can close the connection and retry with a better prompt.
This is especially powerful for AIWave users accessing DeepSeek's API. Instead of paying for a full 500-token hallucination, you cut it off at 50 tokens, adjust your prompt, and retry. Over thousands of requests, this adds up to significant savings.
A Note on Edge Functions
If you're deploying your streaming endpoint on Cloudflare Workers, Vercel Edge Functions, or Deno Deploy, make sure your runtime supports streaming responses. All three do, but the patterns differ:
Cloudflare Workers: Return a
Responsewith aReadableStreambodyVercel Edge: Use the
runtime: 'edge'export and return a streaming ResponseDeno Deploy: Native
ResponsewithReadableStream
The key gotcha: don't buffer the entire upstream response before returning. Pipe the upstream stream directly as your response body.
Conclusion
SSE streaming is the difference between an AI app that feels alive and one that feels like it's thinking too hard. The protocol is simple, the implementation is straightforward, and the UX improvement is immediate.
Start with the code snippets above, wire them up to your favorite LLM provider, and you'll have a streaming chat interface running in under an hour. Your users will notice.
