13 - Build: Concurrent Web Scraper

📋 Jump to Takeaways

Time to put everything together. You're going to build a concurrent web scraper that crawls a website, extracts links, and follows them — with worker pools, rate limiting, context cancellation, and proper error handling. This is real code you could actually use.

This uses patterns from every lesson in this course. If something feels familiar, good — that's the point.

What We're Building

A CLI tool that:

Takes a starting URL
Fetches the page, extracts all links
Follows links on the same domain
Limits concurrency (worker pool)
Rate limits requests (don't hammer the server)
Stops after a timeout or max pages
Reports all discovered URLs

Project Setup

mkdir scraper && cd scraper
go mod init scraper
go get golang.org/x/time/rate
go get golang.org/x/net/html

Step 1: URL Extraction

Parse HTML and extract all <a href="..."> links.

package main

import (
    "net/url"
    "strings"

    "golang.org/x/net/html"
)

func extractLinks(body io.Reader, baseURL *url.URL) []string {
    var links []string
    tokenizer := html.NewTokenizer(body)

    for {
        tt := tokenizer.Next()
        if tt == html.ErrorToken {
            break
        }
        if tt == html.StartTagToken {
            token := tokenizer.Token()
            if token.Data != "a" {
                continue
            }
            for _, attr := range token.Attr {
                if attr.Key != "href" {
                    continue
                }
                link, err := baseURL.Parse(attr.Val)
                if err != nil {
                    continue
                }
                // Only follow HTTP(S) links on the same host
                if link.Host == baseURL.Host && strings.HasPrefix(link.Scheme, "http") {
                    link.Fragment = "" // remove #anchors
                    links = append(links, link.String())
                }
            }
        }
    }

    return links
}

We resolve relative URLs against the base URL and filter to same-domain links only.

Step 2: Fetcher

Fetch a URL with context and timeout.

type FetchResult struct {
    URL   string
    Links []string
    Err   error
}

func fetch(ctx context.Context, client *http.Client, rawURL string) FetchResult {
    req, err := http.NewRequestWithContext(ctx, "GET", rawURL, nil)
    if err != nil {
        return FetchResult{URL: rawURL, Err: err}
    }
    req.Header.Set("User-Agent", "GoScraper/1.0")

    resp, err := client.Do(req)
    if err != nil {
        return FetchResult{URL: rawURL, Err: err}
    }
    defer resp.Body.Close()

    if resp.StatusCode != http.StatusOK {
        return FetchResult{URL: rawURL, Err: fmt.Errorf("status %d", resp.StatusCode)}
    }

    baseURL, _ := url.Parse(rawURL)
    links := extractLinks(resp.Body, baseURL)

    return FetchResult{URL: rawURL, Links: links}
}

Step 3: Visited Tracker

Track which URLs you've already seen. Without this, the scraper would visit the same page over and over. Must be goroutine-safe since multiple workers check it simultaneously.

type Visited struct {
    mu   sync.Mutex
    urls map[string]bool
}

func NewVisited() *Visited {
    return &Visited{urls: make(map[string]bool)}
}

// Add returns true if the URL was new (not seen before)
func (v *Visited) Add(url string) bool {
    v.mu.Lock()
    defer v.mu.Unlock()
    if v.urls[url] {
        return false
    }
    v.urls[url] = true
    return true
}

func (v *Visited) Count() int {
    v.mu.Lock()
    defer v.mu.Unlock()
    return len(v.urls)
}

Add returns whether the URL was new. This lets workers skip duplicates without a separate check.

Step 4: Worker Pool with Rate Limiting

Workers pull URLs from a channel, fetch them, and send discovered links back.

func worker(
    ctx context.Context,
    id int,
    client *http.Client,
    limiter *rate.Limiter,
    jobs <-chan string,
    results chan<- FetchResult,
    wg *sync.WaitGroup,
) {
    defer wg.Done()

    for {
        select {
        case <-ctx.Done():
            return
        case url, ok := <-jobs:
            if !ok {
                return
            }

            // Rate limit
            if err := limiter.Wait(ctx); err != nil {
                return
            }

            result := fetch(ctx, client, url)
            select {
            case results <- result:
            case <-ctx.Done():
                return
            }
        }
    }
}

Each worker respects the shared rate limiter and context cancellation.

Step 5: Coordinator

The coordinator is the brain of the scraper. It seeds the first URL, processes results, enqueues new URLs, and decides when to stop. This is the most complex part — take your time reading through it.

type Scraper struct {
    startURL   string
    maxPages   int
    workers    int
    rateLimit  rate.Limit
    timeout    time.Duration
}

func (s *Scraper) Run() ([]string, error) {
    ctx, cancel := context.WithTimeout(context.Background(), s.timeout)
    defer cancel()

    client := &http.Client{Timeout: 10 * time.Second}
    limiter := rate.NewLimiter(s.rateLimit, int(s.rateLimit))
    visited := NewVisited()

    jobs := make(chan string, s.maxPages)
    results := make(chan FetchResult, s.maxPages)

    // Start workers
    var wg sync.WaitGroup
    for i := 0; i < s.workers; i++ {
        wg.Add(1)
        go worker(ctx, i, client, limiter, jobs, results, &wg)
    }

    // Close results when all workers are done
    go func() {
        wg.Wait()
        close(results)
    }()

    // Seed the first URL
    visited.Add(s.startURL)
    jobs <- s.startURL

    // Track pending jobs to know when to stop
    pending := 1

    // Process results
    var discovered []string
    for result := range results {
        pending--

        if result.Err != nil {
            fmt.Printf("ERROR %s: %v\n", result.URL, result.Err)
        } else {
            fmt.Printf("OK    %s (%d links)\n", result.URL, len(result.Links))
            discovered = append(discovered, result.URL)
        }

        // Enqueue new links
        for _, link := range result.Links {
            if visited.Count() >= s.maxPages {
                break
            }
            if visited.Add(link) {
                pending++
                select {
                case jobs <- link:
                case <-ctx.Done():
                    close(jobs)
                    return discovered, ctx.Err()
                }
            }
        }

        // All work done
        if pending == 0 {
            close(jobs)
            break
        }
    }

    return discovered, nil
}

The pending counter tracks how many URLs are in-flight. When it hits zero, all work is done and we close the jobs channel.

Step 6: Main

func main() {
    if len(os.Args) < 2 {
        fmt.Println("usage: scraper <url>")
        os.Exit(1)
    }

    scraper := &Scraper{
        startURL:  os.Args[1],
        maxPages:  50,
        workers:   5,
        rateLimit: 2, // 2 requests per second
        timeout:   30 * time.Second,
    }

    fmt.Printf("Scraping %s (max %d pages, %d workers, %.0f req/s)\n\n",
        scraper.startURL, scraper.maxPages, scraper.workers, float64(scraper.rateLimit))

    discovered, err := scraper.Run()
    if err != nil {
        fmt.Println("\nstopped:", err)
    }

    fmt.Printf("\n--- Results ---\n")
    fmt.Printf("Pages scraped: %d\n", len(discovered))
    for _, u := range discovered {
        fmt.Println(" ", u)
    }
}

Running It

go run . https://go.dev

Output:

Scraping https://go.dev (max 50 pages, 5 workers, 2 req/s)

OK    https://go.dev (15 links)
OK    https://go.dev/doc/ (23 links)
OK    https://go.dev/blog/ (12 links)
ERROR https://go.dev/dl/: status 403
OK    https://go.dev/learn/ (8 links)
...

--- Results ---
Pages scraped: 34
  https://go.dev
  https://go.dev/doc/
  ...

Patterns Used

Pattern	Where
Context (lesson 2)	Timeout for entire crawl, cancellation propagation
Pipeline (lesson 3)	URLs flow: jobs → workers → results → coordinator
Worker Pool (lesson 5)	Fixed number of fetch workers
Rate Limiting (lesson 6)	Shared `rate.Limiter` across workers
Error Handling (lesson 8)	Errors reported per-URL, crawl continues
Mutex (lesson 9)	`Visited` tracker with `sync.Mutex`
Deadlock prevention (lesson 10)	Buffered channels, pending counter, clean shutdown

Improvements to Try

If you want to keep going, here are some challenges. Each one teaches you something new:

Respect robots.txt — fetch and parse /robots.txt before crawling
Extract page titles — parse <title> tags alongside links
Save results to JSON — write discovered URLs and metadata to a file
Add depth limiting — track how many hops from the start URL
Retry failed requests — retry with exponential backoff on transient errors
Use errgroup — replace the manual WaitGroup + results pattern

Key Takeaways

Real concurrent programs combine multiple patterns — rarely just one
The coordinator pattern (seed → process → enqueue) is common in crawlers and job systems
Pending counters track in-flight work for clean shutdown
Rate limiting is essential when hitting external services
Mutex-protected state (visited tracker) is simpler than channel-based alternatives for lookup-heavy data
Context timeout prevents the program from running forever
Buffered channels prevent deadlocks between the coordinator and workers