04 - Streaming Responses

Why Streaming

In the previous lesson, we set "stream": false and waited for the entire response. That works, but the user stares at nothing while the model generates. A 500-token response at 30 tokens/second takes 16 seconds of silence.

Streaming sends tokens as they're generated. The user sees words appearing in real time. Every chat interface you've used does this.

Without streaming:
  [16 seconds of nothing] → "Go excels at building concurrent..."

With streaming:
  "Go" → " excels" → " at" → " building" → " concurrent" → ...
  (each token appears instantly)

How Streaming Works

When you set "stream": true, Ollama doesn't return a single JSON response. It returns a stream of newline-delimited JSON objects, one per token. Each chunk contains a partial message.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "What is Go?"}],
  "stream": true
}'

Each line in the response is a JSON object with one token:

{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":","},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":" also"},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":" known"},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":" as"},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":" G"},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":"olang"},"done":false}
...
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":"."},"done":true}

Each line is a complete JSON object with the model name, a timestamp, and a message containing one token. The done field tells you when the model is finished.

Reading a Stream in Go

Use bufio.Scanner to read the response line by line. Parse each line as JSON, extract the content, and print it immediately.

package main

import (
	"bufio"
	"bytes"
	"encoding/json"
	"fmt"
	"net/http"
)

type Message struct {
	Role    string `json:"role"`
	Content string `json:"content"`
}

type ChatRequest struct {
	Model    string    `json:"model"`
	Messages []Message `json:"messages"`
	Stream   bool      `json:"stream"`
}

type StreamChunk struct {
	Message Message `json:"message"`
	Done    bool    `json:"done"`
}

func main() {
	req := ChatRequest{
		Model:    "llama3.2",
		Messages: []Message{{Role: "user", Content: "Explain goroutines in 3 sentences."}},
		Stream:   true,
	}

	body, _ := json.Marshal(req)
	resp, err := http.Post("http://localhost:11434/api/chat", "application/json", bytes.NewReader(body))
	if err != nil {
		fmt.Println("Error:", err)
		return
	}
	defer resp.Body.Close()

	scanner := bufio.NewScanner(resp.Body)
	for scanner.Scan() {
		var chunk StreamChunk
		if err := json.Unmarshal(scanner.Bytes(), &chunk); err != nil {
			continue
		}
		fmt.Print(chunk.Message.Content) // no newline, tokens flow continuously
		if chunk.Done {
			fmt.Println() // newline at the end
		}
	}
}

Run this and you'll see tokens appear one at a time in your terminal, exactly like a chat interface.

Collecting the Full Response

Streaming is great for display, but you often need the complete text afterward, for logging, storing in conversation history, or processing further.

func chatStream(messages []Message) (string, error) {
	req := ChatRequest{
		Model:    "llama3.2",
		Messages: messages,
		Stream:   true,
	}

	body, _ := json.Marshal(req)
	resp, err := http.Post("http://localhost:11434/api/chat", "application/json", bytes.NewReader(body))
	if err != nil {
		return "", err
	}
	defer resp.Body.Close()

	var full strings.Builder
	scanner := bufio.NewScanner(resp.Body)

	for scanner.Scan() {
		var chunk StreamChunk
		if err := json.Unmarshal(scanner.Bytes(), &chunk); err != nil {
			continue
		}
		fmt.Print(chunk.Message.Content) // stream to terminal
		full.WriteString(chunk.Message.Content) // collect full text
		if chunk.Done {
			fmt.Println()
		}
	}

	return full.String(), scanner.Err()
}

Now you have both: real-time display AND the complete response.

reply, err := chatStream([]Message{
	{Role: "user", Content: "What are channels?"},
})
fmt.Printf("\n--- Full response (%d chars) ---\n%s\n", len(reply), reply)

Using Channels for Clean Separation

In a real application, you don't want the HTTP reader and the display logic tangled together. Use a Go channel to separate them.

type StreamEvent struct {
	Text string
	Done bool
	Err  error
}

func chatStreamChan(messages []Message) <-chan StreamEvent {
	ch := make(chan StreamEvent, 16) // buffered so the reader doesn't block waiting on the consumer

	go func() {
		defer close(ch)

		req := ChatRequest{Model: "llama3.2", Messages: messages, Stream: true}
		body, _ := json.Marshal(req)
		resp, err := http.Post("http://localhost:11434/api/chat", "application/json", bytes.NewReader(body))
		if err != nil {
			ch <- StreamEvent{Err: err}
			return
		}
		defer resp.Body.Close()

		scanner := bufio.NewScanner(resp.Body)
		for scanner.Scan() {
			var chunk StreamChunk
			if json.Unmarshal(scanner.Bytes(), &chunk) != nil {
				continue
			}
			ch <- StreamEvent{Text: chunk.Message.Content, Done: chunk.Done}
			if chunk.Done {
				return
			}
		}
		if err := scanner.Err(); err != nil {
			ch <- StreamEvent{Err: err}
		}
	}()

	return ch
}

Consumer is clean and simple

events := chatStreamChan([]Message{
	{Role: "user", Content: "What is a mutex?"},
})

var full strings.Builder
for ev := range events {
	if ev.Err != nil {
		fmt.Println("Error:", ev.Err)
		break
	}
	fmt.Print(ev.Text)
	full.WriteString(ev.Text)
}
fmt.Println()
// full.String() has the complete response

The goroutine reads from the network. The main goroutine consumes events. Neither knows about the other's implementation. This is the same pattern used in production tools.

OpenAI Streaming Format

OpenAI uses Server-Sent Events (SSE) instead of newline-delimited JSON. Each line starts with data: and the stream ends with data: [DONE].

data: {"choices":[{"delta":{"content":"Go"}}]}
data: {"choices":[{"delta":{"content":" is"}}]}
data: {"choices":[{"delta":{"content":" great"}}]}
data: [DONE]

The concept is identical. The wire format differs slightly. Ollama's format is simpler to parse.

Key Takeaways

Streaming sends tokens as they're generated instead of waiting for the full response
Ollama streams newline-delimited JSON with a done field marking the end
Use bufio.Scanner to read the stream line by line
Collect tokens in a strings.Builder when you need the full response afterward
Go channels cleanly separate the HTTP reader from the consumer
OpenAI uses SSE (data: ...) instead of newline-delimited JSON, but the concept is the same