10. Evaluating Model Quality

The Problem with Vibes

Most people evaluate models by "vibes." They try a few prompts, see if the output looks good, and declare a winner. This works for personal use, but it falls apart when you need to make a real decision: which model should power my feature in production?

You need a systematic way to compare models on your specific task. Not benchmarks, not Twitter opinions. Your own data, your own criteria.

What to Evaluate

Model quality isn't one thing. It breaks down into:

Dimension	What it means
Accuracy	Does the model get the right answer?
Format compliance	Does it follow your output format exactly?
Consistency	Does it give the same quality every time, or does it vary?
Latency	How fast does it respond?
Cost	How much does it cost per request?

A model might be accurate but slow. Another might be fast and cheap but inconsistent. You need to decide which dimensions matter most for your specific use case.

Building a Test Set

The foundation of any good evaluation is a test set: a collection of inputs where you already know the correct output. Think of it like unit testing for AI — you define the input, you define what the correct answer should be, and then you check if the model's output matches. You're not asking the model to grade itself. You already have the right answers; you're measuring how often the model gets there.

Test set for a ticket classifier:

Input: "I can't log in to my account"
Expected: "account-access"

Input: "How do I cancel my subscription?"
Expected: "billing"

Input: "The page loads slowly on mobile"
Expected: "performance"

Input: "Can you add dark mode?"
Expected: "feature-request"

You need at least 20 to 50 examples to get meaningful results. More is better, but even 20 will show you clear differences between models.

Where to get test data

Real data: Pull from your actual production inputs. This is the best source because it reflects what the model will actually see.
Manual creation: Write examples yourself. Good for getting started quickly.
Edge cases: Intentionally include tricky inputs that you know are hard. This is where models differ most.

Running an Evaluation

The process is straightforward:

Pick 2 to 4 models to compare
Run every test input through each model
Score the outputs
Compare scores

Test: 50 support tickets, classify into 8 categories

Results:
  GPT-4o:      47/50 correct (94%)
  GPT-4o-mini: 45/50 correct (90%)
  Claude Haiku: 44/50 correct (88%)
  Qwen 2.5 7B: 41/50 correct (82%)

Cost per 50 requests:
  GPT-4o:      $0.15
  GPT-4o-mini: $0.01
  Claude Haiku: $0.008
  Qwen 2.5 7B: $0.00 (local)

Now you have data to make a decision. Is the 6% accuracy gap between GPT-4o and Haiku worth 18x the cost? For most classification tasks, no.

Scoring Methods

Exact match

For classification and extraction tasks, check if the output exactly matches the expected answer.

Expected: "billing"
Model output: "billing"
Score: 1 (correct)

Model output: "Billing"
Score: 0 or 1? (decide upfront if case matters for your use case)

Rubric scoring

For open-ended tasks (summaries, code generation, writing), there's no single "correct" answer to match against. Instead, you define a scoring rubric — a set of criteria with point values that describe what good, okay, and bad outputs look like.

Code generation rubric (0 to 3 points):
  3: Correct, handles edge cases, clean code
  2: Correct for the main case, misses edge cases
  1: Partially correct, has bugs
  0: Wrong approach or doesn't compile

You then score each model output against the rubric. For example, if you ask three models to "write a function that divides two numbers safely":

Model A output: checks for zero division, returns error → Score: 3
Model B output: works but crashes on zero → Score: 2
Model C output: wrong logic entirely → Score: 0

After scoring all test cases, average the scores per model. The model with the highest average wins. This is more subjective than exact match, but it's the only practical way to evaluate tasks where multiple valid answers exist.

LLM-as-judge

Use a strong model to evaluate a weaker model's output. This scales better than manual scoring when you have hundreds of test cases.

System: "You are evaluating AI-generated code reviews.
Score each review from 1 to 5 based on:
- Did it find real issues? (not false positives)
- Are the suggestions actionable?
- Is it concise?"

User: "Here is the code: [code]
Here is the review: [model output]
Score it 1 to 5 with a one-line justification."

Use GPT-4o or Claude Sonnet as the judge. Don't use the same model you're evaluating, since it will be biased toward its own style.

A/B Testing in Production

Once you've picked a model based on offline evaluation, validate it with real users. A/B testing means splitting your real traffic into two groups and giving each group a different model — then measuring which performs better.

Why not just trust the test set? Because your test set is a controlled sample. Real users send weird inputs, typos, edge cases, and use the product in ways you didn't anticipate. A/B testing catches problems your offline evaluation missed.

How to do it:

Group A (90% of traffic): keeps using your current model
Group B (10% of traffic): gets routed to the new model
Measure user satisfaction (thumbs up/down, task completion rate, follow-up questions)
Compare after enough data (a few days/weeks depending on traffic)
If Group B performs equally or better, roll the new model out to everyone. If worse, revert — only 10% saw it, no harm done.

It's the same concept used everywhere in tech — testing two versions of a webpage, a button color, a recommendation algorithm — applied to model selection.

Common Evaluation Mistakes

Testing on too few examples. Five examples isn't enough. You need at least 20 to see patterns. One lucky or unlucky example can skew your entire conclusion.

Not testing edge cases. Models often handle the easy cases similarly. The difference shows up on hard inputs: ambiguous text, unusual formatting, long inputs, multiple languages.

Ignoring consistency. Run the same input 5 times. Does the model give the same answer? Some models are more deterministic than others. Set temperature to 0 for evaluation to remove this variable.

Evaluating the wrong thing. If your task is classification, don't evaluate writing quality. If your task is code generation, don't evaluate how well the model explains its approach. Measure what actually matters for your use case.

Quick Evaluation Script

Here's a minimal approach to compare models:

# test_cases.txt contains: input|expected (one per line)
# Results are written to results.csv

echo "input,expected,gpt4o,gpt4o_mini,haiku" > results.csv

while IFS='|' read -r input expected; do
  gpt4o=$(call_model "gpt-4o" "$input")
  mini=$(call_model "gpt-4o-mini" "$input")
  haiku=$(call_model "claude-haiku" "$input")
  echo "$input,$expected,$gpt4o,$mini,$haiku" >> results.csv
done < test_cases.txt

# Then count correct answers per model

You don't need a fancy framework. A script that calls each model and compares outputs is enough to start.

Key Takeaways

Don't evaluate by vibes. Build a test set with known correct answers.
20 to 50 test examples is enough to see meaningful differences between models.
Measure what matters: accuracy, format compliance, consistency, latency, cost.
Use exact match for classification. Use rubrics or LLM-as-judge for open-ended tasks.
Include edge cases in your test set. That's where models actually differ.
Set temperature to 0 during evaluation for consistent, comparable results.
Validate in production with A/B testing after offline evaluation.