Build a File Scanner

📋 Jump to Takeaways

Time to put everything together. We're building a CLI tool that scans a directory, reports file statistics, and finds duplicates — all using concurrency. This project touches structs, interfaces, error handling, goroutines, channels, maps, and slices. If you can build this, you understand Go.

What We're Building

A command-line tool that scans a directory and reports:

Total files and total size
Breakdown by file extension
Duplicate files (same size + same content hash)

go run main.go ~/Documents

Scanning: /Users/you/Documents
Found 247 files (18.3 MB)

By extension:
  .pdf    82 files   12.1 MB
  .md     64 files    1.4 MB
  .go     53 files    0.9 MB
  .txt    48 files    3.9 MB

Duplicates:
  notes.md (3 copies, 4.2 KB each)
  photo.jpg (2 copies, 1.1 MB each)

Try building it yourself first. Then compare with the solution below.

Step 1: Define the Data

package main

import (
	"crypto/md5"
	"fmt"
	"io"
	"os"
	"path/filepath"
	"sort"
	"sync"
)

type FileInfo struct {
	Path string
	Size int64
	Ext  string
	Hash string
}

type ExtStats struct {
	Count int
	Size  int64
}

FileInfo holds everything we need about a single file. ExtStats tracks totals per extension.

Step 2: Walk the Directory

func collectFiles(root string) ([]string, error) {
	var paths []string
	err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {
		if err != nil {
			return nil // skip files we can't read
		}
		if !info.IsDir() {
			paths = append(paths, path)
		}
		return nil
	})
	return paths, err
}

filepath.Walk recursively visits every file. We skip errors silently because a scanner shouldn't crash on one unreadable file.

Step 3: Hash a File

func hashFile(path string) (string, error) {
	f, err := os.Open(path)
	if err != nil {
		return "", err
	}
	defer f.Close()

	h := md5.New()
	if _, err := io.Copy(h, f); err != nil {
		return "", err
	}
	return fmt.Sprintf("%x", h.Sum(nil)), nil
}

MD5 is fast and fine for duplicate detection. io.Copy streams the file through the hasher without loading it all into memory. This is io.Reader and io.Writer in action.

Step 4: Process Files Concurrently

func processFiles(paths []string) []FileInfo {
	results := make(chan FileInfo, len(paths))
	var wg sync.WaitGroup

	for _, p := range paths {
		wg.Add(1)
		go func(path string) {
			defer wg.Done()

			info, err := os.Stat(path)
			if err != nil {
				return
			}

			hash, _ := hashFile(path)

			results <- FileInfo{
				Path: path,
				Size: info.Size(),
				Ext:  filepath.Ext(path),
				Hash: hash,
			}
		}(p)
	}

	go func() {
		wg.Wait()
		close(results)
	}()

	var files []FileInfo
	for f := range results {
		files = append(files, f)
	}
	return files
}

Each file gets its own goroutine. The WaitGroup tracks completion. Results flow through a channel. This is the fan-out/fan-in pattern from the concurrency lesson.

Note: For large directories (100K+ files), spawning one goroutine per file would hit file descriptor limits and thrash the disk. A worker pool with bounded concurrency (e.g., 8 workers) is the production pattern — the disk saturates well before 100K simultaneous reads help. We use fan-out/fan-in here because it's the simplest way to see goroutines, channels, and WaitGroup working together.

Why `chan FileInfo` and not `chan *FileInfo`?

The channel is typed as a value (chan FileInfo), not a pointer. This is intentional:

Small struct, cheap to copy. FileInfo is just a couple of strings and an int64 — well under 256 bytes. Copying it into the channel buffer costs almost nothing.
One contiguous allocation. make(chan FileInfo, len(paths)) allocates a single backing array. With pointers, you'd get len(paths) separate heap allocations plus the pointer slots — more work for the garbage collector.
Isolation for free. Once a value is sent, the sender and receiver have independent copies. No shared memory, no data races. With pointers, both sides reference the same memory, and you'd need to be careful not to mutate after sending.

Use chan *T when the struct is large (big slices, nested maps, embedded byte buffers) and copying becomes expensive. For small data types like this, values are simpler, faster, and safer.

Step 5: Analyze and Report

func formatSize(bytes int64) string {
	const (
		KB = 1024
		MB = KB * 1024
	)
	switch {
	case bytes >= MB:
		return fmt.Sprintf("%.1f MB", float64(bytes)/float64(MB))
	case bytes >= KB:
		return fmt.Sprintf("%.1f KB", float64(bytes)/float64(KB))
	default:
		return fmt.Sprintf("%d B", bytes)
	}
}

func analyze(files []FileInfo) {
	var totalSize int64
	extMap := make(map[string]*ExtStats)
	hashMap := make(map[string][]FileInfo)

	for _, f := range files {
		totalSize += f.Size

		if _, ok := extMap[f.Ext]; !ok {
			extMap[f.Ext] = &ExtStats{}
		}
		extMap[f.Ext].Count++
		extMap[f.Ext].Size += f.Size

		if f.Hash != "" {
			hashMap[f.Hash] = append(hashMap[f.Hash], f)
		}
	}

	fmt.Printf("Found %d files (%s)\n\n", len(files), formatSize(totalSize))

	// Sort extensions by file count
	type extEntry struct {
		Ext   string
		Stats *ExtStats
	}
	var exts []extEntry
	for ext, stats := range extMap {
		name := ext
		if name == "" {
			name = "(none)"
		}
		exts = append(exts, extEntry{name, stats})
	}
	sort.Slice(exts, func(i, j int) bool {
		return exts[i].Stats.Count > exts[j].Stats.Count
	})

	fmt.Println("By extension:")
	for _, e := range exts {
		fmt.Printf("  %-8s %4d files   %s\n", e.Ext, e.Stats.Count, formatSize(e.Stats.Size))
	}

	// Find duplicates
	fmt.Println("\nDuplicates:")
	found := false
	for _, group := range hashMap {
		if len(group) < 2 {
			continue
		}
		found = true
		name := filepath.Base(group[0].Path)
		fmt.Printf("  %s (%d copies, %s each)\n", name, len(group), formatSize(group[0].Size))
	}
	if !found {
		fmt.Println("  No duplicates found.")
	}
}

Maps do the heavy lifting: one groups stats by extension, another groups files by hash to find duplicates. The comma-ok pattern checks for existing keys.

Step 6: Wire It Up

func main() {
	if len(os.Args) < 2 {
		fmt.Fprintln(os.Stderr, "Usage: filescan <directory>")
		os.Exit(1)
	}

	root := os.Args[1]
	fmt.Printf("Scanning: %s\n", root)

	paths, err := collectFiles(root)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error: %v\n", err)
		os.Exit(1)
	}

	if len(paths) == 0 {
		fmt.Println("No files found.")
		return
	}

	files := processFiles(paths)
	analyze(files)
}

The Complete File

Put everything above into a single main.go. Then:

go run main.go ~/Documents

Or build a binary and run it anywhere:

go build -o filescan main.go
./filescan ~/Documents

That single binary has zero dependencies. Copy it to any machine with the same OS and it just works.

What You Used

Concept	Where
Structs	`FileInfo`, `ExtStats` for grouping related data
Methods	`formatSize` helper using switch
Error handling	Every file operation checks and handles errors
Slices & Maps	Collecting results, grouping by extension and hash
Goroutines	One per file for concurrent processing
Channels	Fan-in pattern to collect results
WaitGroup	Coordinating goroutine completion
defer	Closing files after hashing
Interfaces	`io.Reader`/`io.Writer` via `io.Copy` for hashing

Key Takeaways

Real Go programs combine all the fundamentals — structs, error handling, concurrency, and collections work together
The fan-out/fan-in pattern (goroutines + channel + WaitGroup) is how you parallelize work in Go
io.Copy with a hasher shows the power of Go's small interfaces — no need to load entire files into memory
filepath.Walk handles recursive directory traversal
A single go build gives you a portable binary with zero dependencies
Always validate input and handle errors — even in a small CLI tool