08 - RAG from Scratch
📋 Jump to TakeawaysWhat Is RAG
Ask an LLM about your company's refund policy and it has no idea. It wasn't trained on your data. You could fine-tune a model on your documents, but that's expensive and slow.
RAG (Retrieval-Augmented Generation) is the shortcut. Before asking the model, search your documents for relevant content and paste it into the prompt. The model reads your content and answers from it.
Without RAG:
User: "What's our refund policy?"
Model: "I don't have that information."
With RAG:
1. Search your docs for "refund policy"
2. Find the relevant paragraph
3. Paste it into the prompt
4. Model: "You can request a full refund
within 14 days of purchase."The model didn't learn your refund policy. It read it just now, in the prompt you sent.
The Pipeline
RAG has six steps. This lesson builds each one.
Load → Chunk → Embed → Store → Retrieve → GenerateLoad: Read your documents. Chunk: Split them into small pieces. Embed: Convert each piece into a vector (lesson 07). Store: Keep the vectors in memory (or a database). Retrieve: When a user asks a question, find the closest chunks. Generate: Send those chunks + the question to the LLM.
Step 1: Chunk
You can't embed an entire document at once. Embedding models work best on short, focused text. Split your documents into chunks of 500-800 characters, breaking at paragraph boundaries so you don't cut sentences in half.
func chunk(text string, size int) []string {
paragraphs := strings.Split(text, "\n\n")
var chunks []string
var current string
for _, p := range paragraphs {
p = strings.TrimSpace(p)
if p == "" {
continue
}
if len(current)+len(p) > size && current != "" {
chunks = append(chunks, current)
current = ""
}
if current != "" {
current += "\n\n"
}
current += p
}
if current != "" {
chunks = append(chunks, current)
}
return chunks
}Split on double newlines, group paragraphs until you hit the size limit, start a new chunk. That's it. A more robust version is in the examples section.
Use it on a file:
content, _ := os.ReadFile("docs/refund-policy.md")
parts := chunk(string(content), 600)
fmt.Printf("%d chunks\n", len(parts))
// 4 chunksWhy 600? Small enough that each chunk is about one topic. Large enough to have useful context. There's no magic number. 500-800 works for most cases.
Step 2: Embed and Store
Convert each chunk into a vector using the embed() function from lesson 07. Store the text, the source filename, and the vector together.
type Chunk struct {
Text string
Source string
Embedding []float64
}
func indexDocuments(dir string) ([]Chunk, error) {
files, _ := os.ReadDir(dir)
var chunks []Chunk
for _, f := range files {
if !strings.HasSuffix(f.Name(), ".md") {
continue
}
content, _ := os.ReadFile(filepath.Join(dir, f.Name()))
parts := chunk(string(content), 600)
for _, part := range parts {
vec, err := embed(part)
if err != nil {
return nil, fmt.Errorf("embed failed: %w", err)
}
chunks = append(chunks, Chunk{
Text: part,
Source: f.Name(),
Embedding: vec,
})
}
}
return chunks, nil
}This reads every .md file in a directory, chunks each one, embeds each chunk, and returns the whole collection. For 20 documents, this takes a few seconds. For 10,000, you'd want a database.
Step 3: Retrieve
Same idea as the vector search in lesson 07. Embed the user's question, compare it to every chunk, return the top matches.
func retrieve(docs []Chunk, query string, topK int) []Chunk {
queryVec, _ := embed(query)
type scored struct {
chunk Chunk
score float64
}
results := make([]scored, len(docs))
for i, doc := range docs {
results[i] = scored{doc, cosineSimilarity(queryVec, doc.Embedding)}
}
sort.Slice(results, func(i, j int) bool {
return results[i].score > results[j].score
})
top := make([]Chunk, 0, topK)
for i := 0; i < topK && i < len(results); i++ {
top = append(top, results[i].chunk)
}
return top
}3-5 chunks is a good starting point. Too few and you might miss the answer. Too many and you waste tokens and risk confusing the model.
Step 4: Build the Prompt
This is where RAG happens. Take the retrieved chunks and put them in the prompt. Number them so the model can cite sources. Tell the model to only use the provided context.
func buildRAGPrompt(question string, chunks []Chunk) []Message {
var context strings.Builder
for i, c := range chunks {
// Number each chunk so the model can reference [1], [2], etc.
fmt.Fprintf(&context, "[%d] (%s)\n%s\n\n", i+1, c.Source, c.Text)
}
return []Message{
{Role: "system", Content: `Answer using ONLY the provided context.
If the context doesn't have the answer, say "I don't have that information."
Cite sources using [1], [2], etc.`},
{Role: "user", Content: fmt.Sprintf("Context:\n%s\nQuestion: %s",
context.String(), question)},
}
}The "ONLY the provided context" instruction is critical. Without it, the model mixes your documents with its training data and you can't tell which parts are real.
Step 5: Generate
Wire it together. Retrieve, build the prompt, call the model using the chat() function from lesson 03.
func rag(docs []Chunk, question string) (string, error) {
relevant := retrieve(docs, question, 3)
messages := buildRAGPrompt(question, relevant)
return chat(messages)
}Three lines. That's the whole RAG function. Everything else is setup.
answer, _ := rag(docs, "What is the refund policy?")
fmt.Println(answer)
// "You can request a full refund within 14 days
// of purchase [1]. Submit requests through the
// support portal [3]."The model cites [1] and [3] because we numbered the chunks and told it to cite sources.
Putting It Together
An interactive loop that indexes a directory and answers questions.
func main() {
docs, err := indexDocuments("./docs")
if err != nil {
fmt.Println("Indexing failed:", err)
return
}
fmt.Printf("Indexed %d chunks from ./docs\n", len(docs))
scanner := bufio.NewScanner(os.Stdin)
for {
fmt.Print("\nQuestion: ")
if !scanner.Scan() {
break
}
question := scanner.Text()
if question == "" {
continue
}
answer, err := rag(docs, question)
if err != nil {
fmt.Println("Error:", err)
continue
}
fmt.Println("\n" + answer)
}
}Create a docs/ folder with a few markdown files and try it. Any text files work.
What Production RAG Adds
This pipeline works but it's minimal. Real systems add layers:
- Overlapping chunks: Each chunk shares some text with the next, so context isn't lost at boundaries
- Hybrid search: Combine vector search with keyword search for better recall
- Reranking: A second model reorders results by relevance before sending to the LLM
- Conversation history: Include prior messages so the model can resolve "what about that?"
- Evaluation: Measure retrieval quality and answer accuracy (lesson 11)
Key Takeaways
- RAG pastes relevant documents into the prompt so the model can answer from your data
- The pipeline: load, chunk, embed, store, retrieve, generate
- Chunk at paragraph boundaries, 500-800 characters per chunk
- Retrieve the top 3-5 chunks using cosine similarity
- Tell the model to only use the provided context and cite sources
- The
embed()andcosineSimilarity()functions from lesson 07 do the heavy lifting - In-memory storage works for prototyping. Use pgvector for production