Build a File Scanner
📋 Jump to TakeawaysTime to put everything together. We're building a CLI tool that scans a directory, reports file statistics, and finds duplicates — all using concurrency. This project touches structs, interfaces, error handling, goroutines, channels, maps, and slices. If you can build this, you understand Go.
What We're Building
A command-line tool that scans a directory and reports:
- Total files and total size
- Breakdown by file extension
- Duplicate files (same size + same content hash)
go run main.go ~/Documents
Scanning: /Users/you/Documents
Found 247 files (18.3 MB)
By extension:
.pdf 82 files 12.1 MB
.md 64 files 1.4 MB
.go 53 files 0.9 MB
.txt 48 files 3.9 MB
Duplicates:
notes.md (3 copies, 4.2 KB each)
photo.jpg (2 copies, 1.1 MB each)Try building it yourself first. Then compare with the solution below.
Step 1: Define the Data
package main
import (
"crypto/md5"
"fmt"
"io"
"os"
"path/filepath"
"sort"
"sync"
)
type FileInfo struct {
Path string
Size int64
Ext string
Hash string
}
type ExtStats struct {
Count int
Size int64
}FileInfo holds everything we need about a single file. ExtStats tracks totals per extension.
Step 2: Walk the Directory
func collectFiles(root string) ([]string, error) {
var paths []string
err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error {
if err != nil {
return nil // skip files we can't read
}
if !info.IsDir() {
paths = append(paths, path)
}
return nil
})
return paths, err
}filepath.Walk recursively visits every file. We skip errors silently because a scanner shouldn't crash on one unreadable file.
Step 3: Hash a File
func hashFile(path string) (string, error) {
f, err := os.Open(path)
if err != nil {
return "", err
}
defer f.Close()
h := md5.New()
if _, err := io.Copy(h, f); err != nil {
return "", err
}
return fmt.Sprintf("%x", h.Sum(nil)), nil
}MD5 is fast and fine for duplicate detection. io.Copy streams the file through the hasher without loading it all into memory. This is io.Reader and io.Writer in action.
Step 4: Process Files Concurrently
func processFiles(paths []string) []FileInfo {
results := make(chan FileInfo, len(paths))
var wg sync.WaitGroup
for _, p := range paths {
wg.Add(1)
go func(path string) {
defer wg.Done()
info, err := os.Stat(path)
if err != nil {
return
}
hash, _ := hashFile(path)
results <- FileInfo{
Path: path,
Size: info.Size(),
Ext: filepath.Ext(path),
Hash: hash,
}
}(p)
}
go func() {
wg.Wait()
close(results)
}()
var files []FileInfo
for f := range results {
files = append(files, f)
}
return files
}Each file gets its own goroutine. The WaitGroup tracks completion. Results flow through a channel. This is the fan-out/fan-in pattern from the concurrency lesson.
Why chan FileInfo and not chan *FileInfo?
The channel is typed as a value (chan FileInfo), not a pointer. This is intentional:
- Small struct, cheap to copy.
FileInfois just a couple of strings and anint64— well under 256 bytes. Copying it into the channel buffer costs almost nothing. - One contiguous allocation.
make(chan FileInfo, len(paths))allocates a single backing array. With pointers, you'd getlen(paths)separate heap allocations plus the pointer slots — more work for the garbage collector. - Isolation for free. Once a value is sent, the sender and receiver have independent copies. No shared memory, no data races. With pointers, both sides reference the same memory, and you'd need to be careful not to mutate after sending.
Use chan *T when the struct is large (big slices, nested maps, embedded byte buffers) and copying becomes expensive. For small data types like this, values are simpler, faster, and safer.
Step 5: Analyze and Report
func formatSize(bytes int64) string {
const (
KB = 1024
MB = KB * 1024
)
switch {
case bytes >= MB:
return fmt.Sprintf("%.1f MB", float64(bytes)/float64(MB))
case bytes >= KB:
return fmt.Sprintf("%.1f KB", float64(bytes)/float64(KB))
default:
return fmt.Sprintf("%d B", bytes)
}
}
func analyze(files []FileInfo) {
var totalSize int64
extMap := make(map[string]*ExtStats)
hashMap := make(map[string][]FileInfo)
for _, f := range files {
totalSize += f.Size
if _, ok := extMap[f.Ext]; !ok {
extMap[f.Ext] = &ExtStats{}
}
extMap[f.Ext].Count++
extMap[f.Ext].Size += f.Size
if f.Hash != "" {
hashMap[f.Hash] = append(hashMap[f.Hash], f)
}
}
fmt.Printf("Found %d files (%s)\n\n", len(files), formatSize(totalSize))
// Sort extensions by file count
type extEntry struct {
Ext string
Stats *ExtStats
}
var exts []extEntry
for ext, stats := range extMap {
name := ext
if name == "" {
name = "(none)"
}
exts = append(exts, extEntry{name, stats})
}
sort.Slice(exts, func(i, j int) bool {
return exts[i].Stats.Count > exts[j].Stats.Count
})
fmt.Println("By extension:")
for _, e := range exts {
fmt.Printf(" %-8s %4d files %s\n", e.Ext, e.Stats.Count, formatSize(e.Stats.Size))
}
// Find duplicates
fmt.Println("\nDuplicates:")
found := false
for _, group := range hashMap {
if len(group) < 2 {
continue
}
found = true
name := filepath.Base(group[0].Path)
fmt.Printf(" %s (%d copies, %s each)\n", name, len(group), formatSize(group[0].Size))
}
if !found {
fmt.Println(" No duplicates found.")
}
}Maps do the heavy lifting: one groups stats by extension, another groups files by hash to find duplicates. The comma-ok pattern checks for existing keys.
Step 6: Wire It Up
func main() {
if len(os.Args) < 2 {
fmt.Fprintln(os.Stderr, "Usage: filescan <directory>")
os.Exit(1)
}
root := os.Args[1]
fmt.Printf("Scanning: %s\n", root)
paths, err := collectFiles(root)
if err != nil {
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
os.Exit(1)
}
if len(paths) == 0 {
fmt.Println("No files found.")
return
}
files := processFiles(paths)
analyze(files)
}The Complete File
Put everything above into a single main.go. Then:
go run main.go ~/DocumentsOr build a binary and run it anywhere:
go build -o filescan main.go
./filescan ~/DocumentsThat single binary has zero dependencies. Copy it to any machine with the same OS and it just works.
What You Used
| Concept | Where |
|---|---|
| Structs | FileInfo, ExtStats for grouping related data |
| Methods | formatSize helper using switch |
| Error handling | Every file operation checks and handles errors |
| Slices & Maps | Collecting results, grouping by extension and hash |
| Goroutines | One per file for concurrent processing |
| Channels | Fan-in pattern to collect results |
| WaitGroup | Coordinating goroutine completion |
| defer | Closing files after hashing |
| Interfaces | io.Reader/io.Writer via io.Copy for hashing |
Key Takeaways
- Real Go programs combine all the fundamentals — structs, error handling, concurrency, and collections work together
- The fan-out/fan-in pattern (goroutines + channel + WaitGroup) is how you parallelize work in Go
io.Copywith a hasher shows the power of Go's small interfaces — no need to load entire files into memoryfilepath.Walkhandles recursive directory traversal- A single
go buildgives you a portable binary with zero dependencies - Always validate input and handle errors — even in a small CLI tool