12 - Object Storage and Uploads
📋 Jump to TakeawaysModern applications handle massive amounts of unstructured data — images, videos, documents, backups. Object storage is the backbone of how we store and serve these files at scale. Understanding the upload patterns and storage strategies is essential for designing systems that handle file operations efficiently.
Object Storage vs Block Storage vs File Storage
There are three fundamental storage paradigms in cloud infrastructure:
Object Storage (S3, GCS, Azure Blob):
- Flat namespace — no directories, just keys (e.g.,
users/123/avatar.png) - Each object = data + metadata + unique key
- Immutable — you replace objects, not modify them in place
- Accessed via HTTP API (PUT, GET, DELETE)
- Virtually unlimited scale — exabytes of data
- Cheapest per-GB for large volumes of unstructured data
Block Storage (EBS, Persistent Disks):
- Raw storage volumes attached to VMs
- Low latency (sub-millisecond), high IOPS
- Fixed size — must provision capacity upfront
- Supports filesystems (ext4, xfs) — your OS reads/writes blocks
- Use case: databases, OS disks, anything needing fast random I/O
File Storage (EFS, NFS, FSx):
- Shared filesystem accessible by multiple machines simultaneously
- POSIX semantics — directories, permissions, file locks
- Higher latency than block, lower than object
- Use case: shared config, legacy apps expecting a filesystem, CMS media
What are POSIX semantics? POSIX (Portable Operating System Interface) defines how a filesystem must behave for Unix-like systems. Key guarantees:
- Strong consistency — a read after a write always returns the latest data
- Atomic renames —
rename()completes fully or not at all; no partial state visible- Hard links — multiple directory entries can reference the same underlying file (inode)
- Unix permissions — owner/group/other
rwxbits and ACLs- File locking —
flock()/fcntl()for coordinating concurrent access- Sequential ordering — operations from a single process are observed in order by all others
Object storage (S3) does not provide POSIX semantics — there are no directories, no renames, no locks, and no append operations. This is why applications expecting a traditional filesystem cannot use S3 as a drop-in replacement without an adapter layer.
Block Storage vs File Storage
Block is a raw disk for one machine. File is a shared filesystem for many machines.
| Block | File | |
|---|---|---|
| Access | Single VM | Multiple VMs |
| Interface | Mounted volume + filesystem you manage | Mounted filesystem (ready to use) |
| Latency | <1ms | ~5-10ms |
| Scaling | Manual resize | Automatic |
| Sharing | ❌ | ✅ |
| POSIX | Only if you format it with a POSIX FS | Built-in |
| Cost | ~$0.10/GB | ~$0.30/GB |
Analogy: Block storage is like buying a raw hard drive — you format it and plug it into one computer. File storage is like a NAS on your network — everyone can access the same files.
Storage Type Selection
| Criteria | Object | Block | File |
|---|---|---|---|
| Access pattern | HTTP API | Mounted volume | Mounted filesystem |
| Latency | ~50-100ms | <1ms | ~5-10ms |
| Scalability | Unlimited | TB-scale | PB-scale |
| Shared access | Yes (HTTP) | Single VM | Multiple VMs |
| Cost (per GB) | $0.023 | $0.10 | $0.30 |
| Best for | Media, backups, logs | Databases, OS | Shared workloads |
Rule of thumb: if your application uploads/downloads files via an API, use object storage.
Pre-Signed URLs
Problem: routing file uploads and downloads through your API servers creates a bandwidth bottleneck. A 500MB video upload ties up a server connection, consumes memory, and wastes compute — your server is just proxying bytes.
Solution: generate a time-limited, cryptographically signed URL that authorizes the client to upload or download directly to/from the object store.
Pre-Signed Upload Flow
- Client requests an upload URL from your API (sends filename, content type, size)
- Your server validates the request, generates a pre-signed PUT URL for a specific S3 key
- Client uploads the file directly to S3 using the signed URL
- Client calls your API with the object key, confirming the upload finished (e.g.,
POST /api/uploads/complete { key, filename, size }) - Server calls HeadObject to verify the file exists and matches expected size, then records metadata in the database
Pre-Signed Download Flow
- Client requests access to a file
- Server checks authorization, generates a pre-signed GET URL (expires in 15 minutes)
- Client downloads directly from S3 (or CDN)
Signed URL Security Constraints
- URL expires after a configurable duration (typically 5-60 minutes)
- Scoped to a specific object key — cannot access other files
- Scoped to a specific HTTP method (PUT or GET, not both)
- Can enforce content-type and content-length constraints
- Revocation: you can't revoke a signed URL, but short expiry limits exposure
Pre-Signed POST Policies
Pre-signed URLs are scoped to a single object key and HTTP method. POST policies are more flexible — you define a set of conditions, and the client can upload any object matching those conditions without requesting a new URL each time.
The server generates a policy document (base64-encoded JSON) and a signature computed from that policy. The client includes the policy, signature, and the file in a multipart form POST directly to the S3 bucket endpoint.
POST Policy Conditions
- Key prefix — e.g.,
uploads/user-123/allows multiple files under that prefix - Content-length range — e.g., 1 byte to 50MB
- Content-type — restrict to
image/*orvideo/mp4 - Expiration — policy becomes invalid after a timestamp
POST Policy Use Cases
- Batch uploads — mobile app uploading 20 photos without 20 round-trips for signed URLs
- Browser form uploads — HTML
<form>POSTing directly to S3 - User-generated content — let users upload under their own prefix with size/type constraints
Pre-Signed PUT URL vs POST Policy
| Pre-Signed PUT URL | POST Policy | |
|---|---|---|
| Scope | Exact key + method | Prefix + conditions |
| Multiple uploads | One URL per file | One policy, many files |
| Method | PUT | POST (multipart form) |
| Complexity | Simple | More setup |
| Browser forms | ❌ | ✅ |
| Size enforcement | Content-Length header only | Policy-level min/max |
PUT URL vs POST Policy Tradeoffs
POST policies are more powerful but harder to debug — a single mismatched condition silently rejects the upload with a generic 403. Pre-signed PUT URLs are simpler when you know the exact key upfront. Use POST policies when clients need to upload multiple files or when you want server-side size/type enforcement without a proxy.
Multipart Uploads
Problem: uploading large files (>100MB) over unreliable networks fails frequently. A single network hiccup means restarting the entire upload from scratch.
Solution: split the file into parts (5MB–5GB each), upload them independently, then instruct the storage service to assemble them into a single object.
Multipart Upload Flow
- Initiate — request a multipart upload, receive an upload ID
- Upload parts — upload each chunk in parallel, receive an ETag per part
- Complete — send the list of parts + ETags, S3 assembles the final object
Multipart Upload Benefits
- Parallel uploads — saturate bandwidth by uploading 4-8 parts simultaneously
- Resumable — if part 7 of 20 fails, retry only part 7
- Required for large files — S3 limits single PUT to 5GB; multipart supports up to 5TB
- Progress tracking — report completion percentage based on parts uploaded
Incomplete Upload Cleanup
If an upload is abandoned, parts consume storage. Set a lifecycle rule to auto-delete incomplete multipart uploads after 7 days.
Storage Tiers and Lifecycle
Not all data is accessed equally. Storage tiers let you optimize cost based on access frequency:
| Tier | Access | Cost (storage) | Retrieval | Use Case |
|---|---|---|---|---|
| Standard | Frequent | $0.023/GB | Free | Active user files |
| Infrequent Access | Monthly | $0.0125/GB | $0.01/GB | Old reports, logs |
| Glacier Instant | Quarterly | $0.004/GB | $0.03/GB | Compliance archives |
| Glacier Deep | Yearly | $0.00099/GB | $0.02/GB, 12-48hr | Legal retention |
Lifecycle Policies
Automate transitions based on object age:
- After 30 days → move to Infrequent Access
- After 90 days → move to Glacier Instant Retrieval
- After 365 days → move to Glacier Deep Archive
- After 7 years → delete (compliance retention met)
This runs automatically — no application code needed.
CDN Integration
For read-heavy workloads (profile images, course videos, static assets), serve objects through a CDN:
- Origin: your S3 bucket
- Edge locations: cache popular objects close to users (50-300ms → 5-20ms)
- Cache invalidation: use versioned keys (
avatar-v3.png) instead of purging
Private Content via CDN
For authenticated content (paid videos, user documents):
- Signed URLs: CDN generates short-lived URLs per request
- Signed cookies: grant access to multiple objects (e.g., all videos in a course)
- Origin Access Control: block direct S3 access, force all reads through CDN
This gives you both security and performance — users can't bypass the CDN to hit S3 directly.
Key Takeaways
- Use object storage for unstructured data — it's cheap, scalable, and accessed via HTTP, making it ideal for user uploads, media, and backups.
- Never proxy file uploads through your servers — use pre-signed URLs to let clients upload/download directly to S3, saving bandwidth and compute.
- Use multipart uploads for large files — they enable parallel transfer, resumability, and are required for files over 5GB.
- Implement lifecycle policies from day one — automatically transition cold data to cheaper tiers; this compounds into significant savings.
- Pair object storage with a CDN — for any read-heavy access pattern, edge caching reduces latency and offloads origin traffic.
- Signed URLs are your access control layer — short expiry, scoped keys, and method restrictions replace complex proxy authentication.