04 - Streaming Responses
📋 Jump to TakeawaysWhy Streaming
In the previous lesson, we set "stream": false and waited for the entire response. That works, but the user stares at nothing while the model generates. A 500-token response at 30 tokens/second takes 16 seconds of silence.
Streaming sends tokens as they're generated. The user sees words appearing in real time. Every chat interface you've used does this.
Without streaming:
[16 seconds of nothing] → "Go excels at building concurrent..."
With streaming:
"Go" → " excels" → " at" → " building" → " concurrent" → ...
(each token appears instantly)How Streaming Works
When you set "stream": true, Ollama doesn't return a single JSON response. It returns a stream of newline-delimited JSON objects, one per token. Each chunk contains a partial message.
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "What is Go?"}],
"stream": true
}'Each line in the response is a JSON object with one token:
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":","},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":" also"},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":" known"},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":" as"},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":" G"},"done":false}
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":"olang"},"done":false}
...
{"model":"llama3.2","created_at":"...","message":{"role":"assistant","content":"."},"done":true}Each line is a complete JSON object with the model name, a timestamp, and a message containing one token. The done field tells you when the model is finished.
Reading a Stream in Go
Use bufio.Scanner to read the response line by line. Parse each line as JSON, extract the content, and print it immediately.
package main
import (
"bufio"
"bytes"
"encoding/json"
"fmt"
"net/http"
)
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
}
type ChatRequest struct {
Model string `json:"model"`
Messages []Message `json:"messages"`
Stream bool `json:"stream"`
}
type StreamChunk struct {
Message Message `json:"message"`
Done bool `json:"done"`
}
func main() {
req := ChatRequest{
Model: "llama3.2",
Messages: []Message{{Role: "user", Content: "Explain goroutines in 3 sentences."}},
Stream: true,
}
body, _ := json.Marshal(req)
resp, err := http.Post("http://localhost:11434/api/chat", "application/json", bytes.NewReader(body))
if err != nil {
fmt.Println("Error:", err)
return
}
defer resp.Body.Close()
scanner := bufio.NewScanner(resp.Body)
for scanner.Scan() {
var chunk StreamChunk
if err := json.Unmarshal(scanner.Bytes(), &chunk); err != nil {
continue
}
fmt.Print(chunk.Message.Content) // no newline, tokens flow continuously
if chunk.Done {
fmt.Println() // newline at the end
}
}
}Run this and you'll see tokens appear one at a time in your terminal, exactly like a chat interface.
Collecting the Full Response
Streaming is great for display, but you often need the complete text afterward, for logging, storing in conversation history, or processing further.
func chatStream(messages []Message) (string, error) {
req := ChatRequest{
Model: "llama3.2",
Messages: messages,
Stream: true,
}
body, _ := json.Marshal(req)
resp, err := http.Post("http://localhost:11434/api/chat", "application/json", bytes.NewReader(body))
if err != nil {
return "", err
}
defer resp.Body.Close()
var full strings.Builder
scanner := bufio.NewScanner(resp.Body)
for scanner.Scan() {
var chunk StreamChunk
if err := json.Unmarshal(scanner.Bytes(), &chunk); err != nil {
continue
}
fmt.Print(chunk.Message.Content) // stream to terminal
full.WriteString(chunk.Message.Content) // collect full text
if chunk.Done {
fmt.Println()
}
}
return full.String(), scanner.Err()
}Now you have both: real-time display AND the complete response.
reply, err := chatStream([]Message{
{Role: "user", Content: "What are channels?"},
})
fmt.Printf("\n--- Full response (%d chars) ---\n%s\n", len(reply), reply)Using Channels for Clean Separation
In a real application, you don't want the HTTP reader and the display logic tangled together. Use a Go channel to separate them.
type StreamEvent struct {
Text string
Done bool
Err error
}
func chatStreamChan(messages []Message) <-chan StreamEvent {
ch := make(chan StreamEvent, 16) // buffered so the reader doesn't block waiting on the consumer
go func() {
defer close(ch)
req := ChatRequest{Model: "llama3.2", Messages: messages, Stream: true}
body, _ := json.Marshal(req)
resp, err := http.Post("http://localhost:11434/api/chat", "application/json", bytes.NewReader(body))
if err != nil {
ch <- StreamEvent{Err: err}
return
}
defer resp.Body.Close()
scanner := bufio.NewScanner(resp.Body)
for scanner.Scan() {
var chunk StreamChunk
if json.Unmarshal(scanner.Bytes(), &chunk) != nil {
continue
}
ch <- StreamEvent{Text: chunk.Message.Content, Done: chunk.Done}
if chunk.Done {
return
}
}
if err := scanner.Err(); err != nil {
ch <- StreamEvent{Err: err}
}
}()
return ch
}Consumer is clean and simple
events := chatStreamChan([]Message{
{Role: "user", Content: "What is a mutex?"},
})
var full strings.Builder
for ev := range events {
if ev.Err != nil {
fmt.Println("Error:", ev.Err)
break
}
fmt.Print(ev.Text)
full.WriteString(ev.Text)
}
fmt.Println()
// full.String() has the complete responseThe goroutine reads from the network. The main goroutine consumes events. Neither knows about the other's implementation. This is the same pattern used in production tools.
OpenAI Streaming Format
OpenAI uses Server-Sent Events (SSE) instead of newline-delimited JSON. Each line starts with data: and the stream ends with data: [DONE].
data: {"choices":[{"delta":{"content":"Go"}}]}
data: {"choices":[{"delta":{"content":" is"}}]}
data: {"choices":[{"delta":{"content":" great"}}]}
data: [DONE]The concept is identical. The wire format differs slightly. Ollama's format is simpler to parse.
Key Takeaways
- Streaming sends tokens as they're generated instead of waiting for the full response
- Ollama streams newline-delimited JSON with a
donefield marking the end - Use
bufio.Scannerto read the stream line by line - Collect tokens in a
strings.Builderwhen you need the full response afterward - Go channels cleanly separate the HTTP reader from the consumer
- OpenAI uses SSE (
data: ...) instead of newline-delimited JSON, but the concept is the same