12 - Object Storage and Uploads

📋 Jump to Takeaways

Modern applications handle massive amounts of unstructured data — images, videos, documents, backups. Object storage is the backbone of how we store and serve these files at scale. Understanding the upload patterns and storage strategies is essential for designing systems that handle file operations efficiently.

Object Storage vs Block Storage vs File Storage

There are three fundamental storage paradigms in cloud infrastructure:

Object Storage (S3, GCS, Azure Blob):

Flat namespace — no directories, just keys (e.g., users/123/avatar.png)
Each object = data + metadata + unique key
Immutable — you replace objects, not modify them in place
Accessed via HTTP API (PUT, GET, DELETE)
Virtually unlimited scale — exabytes of data
Cheapest per-GB for large volumes of unstructured data

Block Storage (EBS, Persistent Disks):

Raw storage volumes attached to VMs
Low latency (sub-millisecond), high IOPS
Fixed size — must provision capacity upfront
Supports filesystems (ext4, xfs) — your OS reads/writes blocks
Use case: databases, OS disks, anything needing fast random I/O

File Storage (EFS, NFS, FSx):

Shared filesystem accessible by multiple machines simultaneously
POSIX semantics — directories, permissions, file locks
Higher latency than block, lower than object
Use case: shared config, legacy apps expecting a filesystem, CMS media

What are POSIX semantics? POSIX (Portable Operating System Interface) defines how a filesystem must behave for Unix-like systems. Key guarantees:

Strong consistency — a read after a write always returns the latest data

Atomic renames — rename() completes fully or not at all; no partial state visible

Hard links — multiple directory entries can reference the same underlying file (inode)

Unix permissions — owner/group/other rwx bits and ACLs

File locking — flock() / fcntl() for coordinating concurrent access

Sequential ordering — operations from a single process are observed in order by all others

Object storage (S3) does not provide POSIX semantics — there are no directories, no renames, no locks, and no append operations. This is why applications expecting a traditional filesystem cannot use S3 as a drop-in replacement without an adapter layer.

Block Storage vs File Storage

Block is a raw disk for one machine. File is a shared filesystem for many machines.

	Block	File
Access	Single VM	Multiple VMs
Interface	Mounted volume + filesystem you manage	Mounted filesystem (ready to use)
Latency	<1ms	~5-10ms
Scaling	Manual resize	Automatic
Sharing	❌	✅
POSIX	Only if you format it with a POSIX FS	Built-in
Cost	~$0.10/GB	~$0.30/GB

Analogy: Block storage is like buying a raw hard drive — you format it and plug it into one computer. File storage is like a NAS on your network — everyone can access the same files.

Storage Type Selection

Criteria	Object	Block	File
Access pattern	HTTP API	Mounted volume	Mounted filesystem
Latency	~50-100ms	<1ms	~5-10ms
Scalability	Unlimited	TB-scale	PB-scale
Shared access	Yes (HTTP)	Single VM	Multiple VMs
Cost (per GB)	$0.023	$0.10	$0.30
Best for	Media, backups, logs	Databases, OS	Shared workloads

Rule of thumb: if your application uploads/downloads files via an API, use object storage.

Pre-Signed URLs

Problem: routing file uploads and downloads through your API servers creates a bandwidth bottleneck. A 500MB video upload ties up a server connection, consumes memory, and wastes compute — your server is just proxying bytes.

Solution: generate a time-limited, cryptographically signed URL that authorizes the client to upload or download directly to/from the object store.

Pre-Signed Upload Flow

Client requests an upload URL from your API (sends filename, content type, size)
Your server validates the request, generates a pre-signed PUT URL for a specific S3 key
Client uploads the file directly to S3 using the signed URL
Client calls your API with the object key, confirming the upload finished (e.g., POST /api/uploads/complete { key, filename, size })
Server calls HeadObject to verify the file exists and matches expected size, then records metadata in the database

Pre-Signed Download Flow

Client requests access to a file
Server checks authorization, generates a pre-signed GET URL (expires in 15 minutes)
Client downloads directly from S3 (or CDN)

Signed URL Security Constraints

URL expires after a configurable duration (typically 5-60 minutes)
Scoped to a specific object key — cannot access other files
Scoped to a specific HTTP method (PUT or GET, not both)
Can enforce content-type and content-length constraints
Revocation: you can't revoke a signed URL, but short expiry limits exposure

Pre-Signed POST Policies

Pre-signed URLs are scoped to a single object key and HTTP method. POST policies are more flexible — you define a set of conditions, and the client can upload any object matching those conditions without requesting a new URL each time.

The server generates a policy document (base64-encoded JSON) and a signature computed from that policy. The client includes the policy, signature, and the file in a multipart form POST directly to the S3 bucket endpoint.

POST Policy Conditions

Key prefix — e.g., uploads/user-123/ allows multiple files under that prefix
Content-length range — e.g., 1 byte to 50MB
Content-type — restrict to image/* or video/mp4
Expiration — policy becomes invalid after a timestamp

POST Policy Use Cases

Batch uploads — mobile app uploading 20 photos without 20 round-trips for signed URLs
Browser form uploads — HTML <form> POSTing directly to S3
User-generated content — let users upload under their own prefix with size/type constraints

Pre-Signed PUT URL vs POST Policy

	Pre-Signed PUT URL	POST Policy
Scope	Exact key + method	Prefix + conditions
Multiple uploads	One URL per file	One policy, many files
Method	PUT	POST (multipart form)
Complexity	Simple	More setup
Browser forms	❌	✅
Size enforcement	Content-Length header only	Policy-level min/max

PUT URL vs POST Policy Tradeoffs

POST policies are more powerful but harder to debug — a single mismatched condition silently rejects the upload with a generic 403. Pre-signed PUT URLs are simpler when you know the exact key upfront. Use POST policies when clients need to upload multiple files or when you want server-side size/type enforcement without a proxy.

Multipart Uploads

Problem: uploading large files (>100MB) over unreliable networks fails frequently. A single network hiccup means restarting the entire upload from scratch.

Solution: split the file into parts (5MB–5GB each), upload them independently, then instruct the storage service to assemble them into a single object.

Multipart Upload Flow

Initiate — request a multipart upload, receive an upload ID
Upload parts — upload each chunk in parallel, receive an ETag per part
Complete — send the list of parts + ETags, S3 assembles the final object

Multipart Upload Benefits

Parallel uploads — saturate bandwidth by uploading 4-8 parts simultaneously
Resumable — if part 7 of 20 fails, retry only part 7
Required for large files — S3 limits single PUT to 5GB; multipart supports up to 5TB
Progress tracking — report completion percentage based on parts uploaded

Incomplete Upload Cleanup

If an upload is abandoned, parts consume storage. Set a lifecycle rule to auto-delete incomplete multipart uploads after 7 days.

Storage Tiers and Lifecycle

Not all data is accessed equally. Storage tiers let you optimize cost based on access frequency:

Tier	Access	Cost (storage)	Retrieval	Use Case
Standard	Frequent	$0.023/GB	Free	Active user files
Infrequent Access	Monthly	$0.0125/GB	$0.01/GB	Old reports, logs
Glacier Instant	Quarterly	$0.004/GB	$0.03/GB	Compliance archives
Glacier Deep	Yearly	$0.00099/GB	$0.02/GB, 12-48hr	Legal retention

Lifecycle Policies

Automate transitions based on object age:

After 30 days → move to Infrequent Access
After 90 days → move to Glacier Instant Retrieval
After 365 days → move to Glacier Deep Archive
After 7 years → delete (compliance retention met)

This runs automatically — no application code needed.

CDN Integration

For read-heavy workloads (profile images, course videos, static assets), serve objects through a CDN:

Origin: your S3 bucket
Edge locations: cache popular objects close to users (50-300ms → 5-20ms)
Cache invalidation: use versioned keys (avatar-v3.png) instead of purging

Private Content via CDN

For authenticated content (paid videos, user documents):

Signed URLs: CDN generates short-lived URLs per request
Signed cookies: grant access to multiple objects (e.g., all videos in a course)
Origin Access Control: block direct S3 access, force all reads through CDN

This gives you both security and performance — users can't bypass the CDN to hit S3 directly.

Key Takeaways

Use object storage for unstructured data — it's cheap, scalable, and accessed via HTTP, making it ideal for user uploads, media, and backups.
Never proxy file uploads through your servers — use pre-signed URLs to let clients upload/download directly to S3, saving bandwidth and compute.
Use multipart uploads for large files — they enable parallel transfer, resumability, and are required for files over 5GB.
Implement lifecycle policies from day one — automatically transition cold data to cheaper tiers; this compounds into significant savings.
Pair object storage with a CDN — for any read-heavy access pattern, edge caching reduces latency and offloads origin traffic.
Signed URLs are your access control layer — short expiry, scoped keys, and method restrictions replace complex proxy authentication.