11 - Evals and Testing

Why Evals Matter

You just built an agent that calls tools and answers questions. How do you know it works? Not just once, but reliably, across different inputs, after every change you make.

You can't use assert output == expected because LLM output is non-deterministic. The same prompt can produce different results each time. You need a different kind of testing.

Evals (evaluations) measure whether your LLM system actually works. Without them, you change a prompt, it seems better on one example, and you have no idea if it broke ten other cases.

The Idea

Traditional tests check exact values. LLM evals check properties: is the answer correct, relevant, safe, in the right format?

Traditional test:
  assert add(2, 3) == 5

LLM eval:
  assert answer contains "2009"
  assert answer doesn't contain "I'm not sure"
  assert answer is valid JSON

You define a dataset of inputs and what you expect, run your system against all of them, and score the results.

A Test Dataset

Start with a list of questions and what the answer should contain. These aren't exact matches. "programming language" just needs to appear somewhere in the response.

type EvalCase struct {
	Input    string
	Expected string
}

var cases = []EvalCase{
	{"What is Go?", "programming language"},
	{"Who created Go?", "Google"},
	{"What year was Go released?", "2009"},
	{"Is Go object-oriented?", "no"},
}

Four cases is enough to start. You'll add more as you find failures.

Scoring: Contains Check

The simplest scorer: does the output contain the expected answer?

func evalContains(output, expected string) bool {
	return strings.Contains(
		strings.ToLower(output),
		strings.ToLower(expected),
	)
}

Run it against every case and print a scorecard:

passed := 0
for _, tc := range cases {
	output, _ := chat([]Message{
		{Role: "user", Content: tc.Input},
	})
	ok := evalContains(output, tc.Expected)
	if ok {
		passed++
		fmt.Printf("✓ %s\n", tc.Input)
	} else {
		fmt.Printf("✗ %s (expected '%s')\n", tc.Input, tc.Expected)
	}
}
fmt.Printf("\nScore: %d/%d\n", passed, len(cases))

Running it prints a scorecard:

✓ What is Go?
✓ Who created Go?
✗ What year was Go released? (expected '2009')
✓ Is Go object-oriented?

Score: 3/4

Simple, but powerful. If you change your system prompt and the score drops from 3/4 to 1/4, you know something broke.

LLM-as-Judge

String matching works for factual questions. But what about "Explain goroutines"? The answer could be correct in many different ways. You can't check for one keyword.

Use another LLM to judge the quality. Send it the question, the expected answer, and the actual output. Ask it to decide if the output is correct.

func llmJudge(question, output, expected string) bool {
	messages := []Message{
		{Role: "system", Content: `You are an eval judge.
Given a question, expected answer, and actual output,
decide if the output is correct.
Respond with JSON: {"pass": true/false}`},
		{Role: "user", Content: fmt.Sprintf(
			"Question: %s\nExpected: %s\nActual: %s",
			question, expected, output)},
	}
	reply, _ := chat(messages)
	var result struct{ Pass bool `json:"pass"` }
	json.Unmarshal([]byte(reply), &result)
	return result.Pass
}

Use it the same way as evalContains:

ok := llmJudge(
	"Explain goroutines",
	output,
	"Should mention concurrency and the Go runtime",
)

LLM-as-judge is more flexible but slower and more expensive. Use it for open-ended questions where string matching isn't enough.

Evaluating RAG

If you built a RAG pipeline (lesson 08), you need to test two things separately:

Retrieval: Did the system find the right documents? If the answer is in refund-policy.md but the retriever returned faq.md and terms.md, retrieval failed.

Answer: Given the right documents, did the model answer correctly? If retrieval is good but the answer is wrong, your prompt needs work.

Test them independently. If you only test the final answer, you can't tell which part broke.

Tracking Scores Over Time

Run your eval suite before and after every change. Save the results so you can spot regressions.

May 1: 15/20 (75%) prompt:a1b2c3
May 2: 16/20 (80%) prompt:d4e5f6
May 3: 13/20 (65%) prompt:g7h8i9 ← regression!

The prompt change on May 3rd made things worse. Revert it. Without the history, you'd never know.

Even a simple log file works. The format doesn't matter. What matters is that you run evals consistently and compare results.

Building Good Eval Datasets

Start small. 20 cases is enough. Add a new case every time you find a bug.

Good eval dataset:

✓ Covers common questions (happy path)
✓ Covers edge cases (ambiguous, missing context)
✓ Covers out-of-scope questions (should refuse)
✓ Has clear expected outputs
✓ Grows over time

Bad eval dataset:

✗ Only tests easy cases
✗ Expected outputs are vague
✗ Never updated after creation

Every time a user reports a bad answer, add it to your dataset with the correct expected output. Your eval suite becomes a living record of every bug you've fixed.

Key Takeaways

LLM output is non-deterministic, so assert == doesn't work
Start with a dataset of inputs and expected outputs
Contains check is the simplest scorer and catches most regressions
LLM-as-judge handles open-ended questions where keywords aren't enough
RAG systems need separate evaluation for retrieval and answer quality
Track scores over time. Run evals before and after every change
Build your dataset from real failures. Add a case every time you find a bug
20 well-chosen cases are better than no evals at all