08 - RAG from Scratch

What Is RAG

Ask an LLM about your company's refund policy and it has no idea. It wasn't trained on your data. You could fine-tune a model on your documents, but that's expensive and slow.

RAG (Retrieval-Augmented Generation) is the shortcut. Before asking the model, search your documents for relevant content and paste it into the prompt. The model reads your content and answers from it.

Without RAG:
  User: "What's our refund policy?"
  Model: "I don't have that information."

With RAG:
  1. Search your docs for "refund policy"
  2. Find the relevant paragraph
  3. Paste it into the prompt
  4. Model: "You can request a full refund
     within 14 days of purchase."

The model didn't learn your refund policy. It read it just now, in the prompt you sent.

The Pipeline

RAG has six steps. This lesson builds each one.

Load → Chunk → Embed → Store → Retrieve → Generate

Load: Read your documents. Chunk: Split them into small pieces. Embed: Convert each piece into a vector (lesson 07). Store: Keep the vectors in memory (or a database). Retrieve: When a user asks a question, find the closest chunks. Generate: Send those chunks + the question to the LLM.

Step 1: Chunk

You can't embed an entire document at once. Embedding models work best on short, focused text. Split your documents into chunks of 500-800 characters, breaking at paragraph boundaries so you don't cut sentences in half.

func chunk(text string, size int) []string {
	paragraphs := strings.Split(text, "\n\n")
	var chunks []string
	var current string

	for _, p := range paragraphs {
		p = strings.TrimSpace(p)
		if p == "" {
			continue
		}
		if len(current)+len(p) > size && current != "" {
			chunks = append(chunks, current)
			current = ""
		}
		if current != "" {
			current += "\n\n"
		}
		current += p
	}
	if current != "" {
		chunks = append(chunks, current)
	}
	return chunks
}

Split on double newlines, group paragraphs until you hit the size limit, start a new chunk. That's it. A more robust version is in the examples section.

Use it on a file:

content, _ := os.ReadFile("docs/refund-policy.md")
parts := chunk(string(content), 600)
fmt.Printf("%d chunks\n", len(parts))
// 4 chunks

Why 600? Small enough that each chunk is about one topic. Large enough to have useful context. There's no magic number. 500-800 works for most cases.

Step 2: Embed and Store

Convert each chunk into a vector using the embed() function from lesson 07. Store the text, the source filename, and the vector together.

type Chunk struct {
	Text      string
	Source    string
	Embedding []float64
}

func indexDocuments(dir string) ([]Chunk, error) {
	files, _ := os.ReadDir(dir)
	var chunks []Chunk

	for _, f := range files {
		if !strings.HasSuffix(f.Name(), ".md") {
			continue
		}
		content, _ := os.ReadFile(filepath.Join(dir, f.Name()))
		parts := chunk(string(content), 600)

		for _, part := range parts {
			vec, err := embed(part)
			if err != nil {
				return nil, fmt.Errorf("embed failed: %w", err)
			}
			chunks = append(chunks, Chunk{
				Text:      part,
				Source:    f.Name(),
				Embedding: vec,
			})
		}
	}
	return chunks, nil
}

This reads every .md file in a directory, chunks each one, embeds each chunk, and returns the whole collection. For 20 documents, this takes a few seconds. For 10,000, you'd want a database.

Step 3: Retrieve

Same idea as the vector search in lesson 07. Embed the user's question, compare it to every chunk, return the top matches.

func retrieve(docs []Chunk, query string, topK int) []Chunk {
	queryVec, _ := embed(query)

	type scored struct {
		chunk Chunk
		score float64
	}

	results := make([]scored, len(docs))
	for i, doc := range docs {
		results[i] = scored{doc, cosineSimilarity(queryVec, doc.Embedding)}
	}

	sort.Slice(results, func(i, j int) bool {
		return results[i].score > results[j].score
	})

	top := make([]Chunk, 0, topK)
	for i := 0; i < topK && i < len(results); i++ {
		top = append(top, results[i].chunk)
	}
	return top
}

3-5 chunks is a good starting point. Too few and you might miss the answer. Too many and you waste tokens and risk confusing the model.

Step 4: Build the Prompt

This is where RAG happens. Take the retrieved chunks and put them in the prompt. Number them so the model can cite sources. Tell the model to only use the provided context.

func buildRAGPrompt(question string, chunks []Chunk) []Message {
	var context strings.Builder
	for i, c := range chunks {
		// Number each chunk so the model can reference [1], [2], etc.
		fmt.Fprintf(&context, "[%d] (%s)\n%s\n\n", i+1, c.Source, c.Text)
	}

	return []Message{
		{Role: "system", Content: `Answer using ONLY the provided context.
If the context doesn't have the answer, say "I don't have that information."
Cite sources using [1], [2], etc.`},
		{Role: "user", Content: fmt.Sprintf("Context:\n%s\nQuestion: %s",
			context.String(), question)},
	}
}

The "ONLY the provided context" instruction is critical. Without it, the model mixes your documents with its training data and you can't tell which parts are real.

Step 5: Generate

Wire it together. Retrieve, build the prompt, call the model using the chat() function from lesson 03.

func rag(docs []Chunk, question string) (string, error) {
	relevant := retrieve(docs, question, 3)
	messages := buildRAGPrompt(question, relevant)
	return chat(messages)
}

Three lines. That's the whole RAG function. Everything else is setup.

answer, _ := rag(docs, "What is the refund policy?")
fmt.Println(answer)
// "You can request a full refund within 14 days
//  of purchase [1]. Submit requests through the
//  support portal [3]."

The model cites [1] and [3] because we numbered the chunks and told it to cite sources.

Putting It Together

An interactive loop that indexes a directory and answers questions.

func main() {
	docs, err := indexDocuments("./docs")
	if err != nil {
		fmt.Println("Indexing failed:", err)
		return
	}
	fmt.Printf("Indexed %d chunks from ./docs\n", len(docs))

	scanner := bufio.NewScanner(os.Stdin)
	for {
		fmt.Print("\nQuestion: ")
		if !scanner.Scan() {
			break
		}
		question := scanner.Text()
		if question == "" {
			continue
		}

		answer, err := rag(docs, question)
		if err != nil {
			fmt.Println("Error:", err)
			continue
		}
		fmt.Println("\n" + answer)
	}
}

Create a docs/ folder with a few markdown files and try it. Any text files work.

What Production RAG Adds

This pipeline works but it's minimal. Real systems add layers:

Overlapping chunks: Each chunk shares some text with the next, so context isn't lost at boundaries
Hybrid search: Combine vector search with keyword search for better recall
Reranking: A second model reorders results by relevance before sending to the LLM
Conversation history: Include prior messages so the model can resolve "what about that?"
Evaluation: Measure retrieval quality and answer accuracy (lesson 11)

Key Takeaways

RAG pastes relevant documents into the prompt so the model can answer from your data
The pipeline: load, chunk, embed, store, retrieve, generate
Chunk at paragraph boundaries, 500-800 characters per chunk
Retrieve the top 3-5 chunks using cosine similarity
Tell the model to only use the provided context and cite sources
The embed() and cosineSimilarity() functions from lesson 07 do the heavy lifting
In-memory storage works for prototyping. Use pgvector for production