14 - Profiling & Benchmarks

📋 Jump to Takeaways

You think the JSON marshaling is slow. Your colleague thinks it's the database queries. Neither of you knows. That's the problem. Performance intuition is wrong more often than it's right. The rule: profile first, don't guess.

Go ships with profiling and benchmarking tools in the standard library. No external APM (Application Performance Monitoring) needed for the basics.

Benchmarks with testing.B

Benchmark functions live in _test.go files, just like tests. They start with Benchmark and take *testing.B:

// store_test.go
func BenchmarkBookmarkStore_List(b *testing.B) {
    db := setupTestDB(b) // setupTestDB accepts testing.TB, so it works with both *testing.T and *testing.B
    store := NewBookmarkStore(db)

    // Seed some data
    ctx := context.Background()
    for i := 0; i < 100; i++ {
        store.Create(ctx, fmt.Sprintf("https://example.com/%d", i), fmt.Sprintf("Bookmark %d", i))
    }

    b.ResetTimer() // Don't count setup time

    for b.Loop() {
        store.List(ctx)
    }
}

b.Loop() is the idiomatic way to write benchmark loops since Go 1.24. The framework decides how many iterations to run to get a stable measurement. b.ResetTimer() excludes setup from the timing.

If you're on Go 1.23 or earlier, use for i := 0; i < b.N; i++ instead of b.Loop().

Run it:

go test -bench=BenchmarkBookmarkStore_List -benchmem ./api

Output:

BenchmarkBookmarkStore_List-8    5000    230145 ns/op    4096 B/op    52 allocs/op

That's 5000 iterations, ~230 microseconds per call, 4KB allocated, 52 allocations. Now you have numbers, not guesses.

Benchmarking a Handler

Test the full HTTP path including JSON serialization:

func BenchmarkHandleListBookmarks(b *testing.B) {
    db := setupTestDB(b)
    store := NewBookmarkStore(db)

    ctx := context.Background()
    for i := 0; i < 50; i++ {
        store.Create(ctx, fmt.Sprintf("https://example.com/%d", i), fmt.Sprintf("Bookmark %d", i))
    }

    mux := http.NewServeMux()
    registerRoutes(mux, store)

    req := httptest.NewRequest("GET", "/api/bookmarks", nil)
    b.ResetTimer()

    for b.Loop() {
        rec := httptest.NewRecorder()
        mux.ServeHTTP(rec, req)
    }
}

This benchmarks everything: routing, handler logic, database query, JSON encoding, response writing. If it's slow, you can then benchmark individual pieces to find the bottleneck.

Useful Flags

# Run all benchmarks
go test -bench=. ./api

# Run benchmarks matching a pattern
go test -bench=Store ./api

# Include memory stats
go test -bench=. -benchmem ./api

# Run for longer (more stable results)
go test -bench=. -benchtime=5s ./api

# Compare before/after with count
go test -bench=. -count=6 ./api > old.txt
# ... make changes ...
go test -bench=. -count=6 ./api > new.txt

For comparing benchmark results, install benchstat:

go install golang.org/x/perf/cmd/benchstat@latest
benchstat old.txt new.txt

It gives you a statistical comparison with confidence intervals. Much better than eyeballing numbers.

Runtime Profiling with pprof

Benchmarks measure specific functions. pprof profiles the running application. It answers: where is my program spending its time right now?

Add the pprof handler to your server:

import _ "net/http/pprof"

func main() {
    // ... your routes ...

    // pprof registers on DefaultServeMux automatically.
    // Run it on a separate port so it's not exposed publicly.
    go func() {
        slog.Info("pprof listening", "addr", ":6060")
        http.ListenAndServe(":6060", nil)
    }()

    // ... start main server ...
}

The blank import registers handlers on http.DefaultServeMux. We run it on a separate port (6060) so it's never exposed to the internet. In production, only expose this internally or behind authentication.

Collecting Profiles

With the server running, grab a CPU profile:

# 30-second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

This starts a 30-second sampling session. While it runs, hit your API with some traffic. Then pprof drops you into an interactive shell:

(pprof) top10
Showing nodes accounting for 2.5s, 83% of 3s total
      flat  flat%   sum%        cum   cum%
     0.8s 26.67% 26.67%      0.8s 26.67%  runtime.cgocall
     0.5s 16.67% 43.33%      0.5s 16.67%  encoding/json.(*encodeState).marshal
     ...

flat is time spent in the function itself. cum (cumulative) includes time in functions it calls. If cum is high but flat is low, the function is slow because of what it calls, not what it does.

Memory Profiles

# Heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Allocations profile (total allocations, not just live)
go tool pprof -alloc_space http://localhost:6060/debug/pprof/heap

In the pprof shell:

(pprof) top10
(pprof) list handleListBookmarks   # source-level annotation
(pprof) web                         # open graph in browser (needs graphviz)

If you prefer a visual UI instead of the interactive shell, use:

go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

This opens a browser with flame graphs, call graphs, and source annotations. Easier than memorizing pprof commands.

Goroutine and Block Profiles

# How many goroutines and what they're doing
go tool pprof http://localhost:6060/debug/pprof/goroutine

# Where goroutines block on synchronization
go tool pprof http://localhost:6060/debug/pprof/block

The goroutine profile is great for finding leaks. If the count keeps growing, something is spawning goroutines without cleaning them up.

The Workflow

Notice something is slow (or get a report)
Write a benchmark for the specific operation
Run with -benchmem to see allocations
If you need more detail, use pprof on the running server
Find the hot spot
Fix it
Run the benchmark again to confirm the improvement

Don't optimize without measuring. Don't measure without a reason. Most code doesn't need to be fast. The 5% that does will show up in profiles.

Applying to Our Project

Add benchmarks to store_test.go and handler_test.go:

// store_test.go
func BenchmarkBookmarkStore_Create(b *testing.B) {
    db := setupTestDB(b)
    store := NewBookmarkStore(db)
    ctx := context.Background()

    for b.Loop() {
        store.Create(ctx, "https://go.dev", "Go")
    }
}

Add pprof to main.go on a debug port:

import _ "net/http/pprof"

// In main(), before the main server starts:
go func() {
    slog.Info("pprof listening", "addr", ":6060")
    http.ListenAndServe(":6060", nil)
}()

Run benchmarks as part of your development cycle:

go test -bench=. -benchmem ./api

Key Takeaways

Profile first, don't guess. Performance intuition is unreliable
Benchmark functions start with Benchmark, take *testing.B, and use b.Loop()
b.ResetTimer() excludes setup from measurements. -benchmem shows allocations
net/http/pprof profiles the running application: CPU, memory, goroutines, blocking
Run pprof on a separate port. Never expose it publicly
go tool pprof gives you an interactive shell with top, list, and web commands
Use benchstat to compare before/after results with statistical confidence
The workflow: measure, find the hot spot, fix it, measure again