12 - Object Storage and Uploads

📋 Jump to Takeaways

Modern applications handle massive amounts of unstructured data — images, videos, documents, backups. Object storage is the backbone of how we store and serve these files at scale. Understanding the upload patterns and storage strategies is essential for designing systems that handle file operations efficiently.

Object Storage vs Block Storage vs File Storage

There are three fundamental storage paradigms in cloud infrastructure:

Object Storage (S3, GCS, Azure Blob):

  • Flat namespace — no directories, just keys (e.g., users/123/avatar.png)
  • Each object = data + metadata + unique key
  • Immutable — you replace objects, not modify them in place
  • Accessed via HTTP API (PUT, GET, DELETE)
  • Virtually unlimited scale — exabytes of data
  • Cheapest per-GB for large volumes of unstructured data

Block Storage (EBS, Persistent Disks):

  • Raw storage volumes attached to VMs
  • Low latency (sub-millisecond), high IOPS
  • Fixed size — must provision capacity upfront
  • Supports filesystems (ext4, xfs) — your OS reads/writes blocks
  • Use case: databases, OS disks, anything needing fast random I/O

File Storage (EFS, NFS, FSx):

  • Shared filesystem accessible by multiple machines simultaneously
  • POSIX semantics — directories, permissions, file locks
  • Higher latency than block, lower than object
  • Use case: shared config, legacy apps expecting a filesystem, CMS media

What are POSIX semantics? POSIX (Portable Operating System Interface) defines how a filesystem must behave for Unix-like systems. Key guarantees:

  • Strong consistency — a read after a write always returns the latest data
  • Atomic renamesrename() completes fully or not at all; no partial state visible
  • Hard links — multiple directory entries can reference the same underlying file (inode)
  • Unix permissions — owner/group/other rwx bits and ACLs
  • File lockingflock() / fcntl() for coordinating concurrent access
  • Sequential ordering — operations from a single process are observed in order by all others

Object storage (S3) does not provide POSIX semantics — there are no directories, no renames, no locks, and no append operations. This is why applications expecting a traditional filesystem cannot use S3 as a drop-in replacement without an adapter layer.

Block Storage vs File Storage

Block is a raw disk for one machine. File is a shared filesystem for many machines.

Block File
Access Single VM Multiple VMs
Interface Mounted volume + filesystem you manage Mounted filesystem (ready to use)
Latency <1ms ~5-10ms
Scaling Manual resize Automatic
Sharing
POSIX Only if you format it with a POSIX FS Built-in
Cost ~$0.10/GB ~$0.30/GB

Analogy: Block storage is like buying a raw hard drive — you format it and plug it into one computer. File storage is like a NAS on your network — everyone can access the same files.

Storage Type Selection

Criteria Object Block File
Access pattern HTTP API Mounted volume Mounted filesystem
Latency ~50-100ms <1ms ~5-10ms
Scalability Unlimited TB-scale PB-scale
Shared access Yes (HTTP) Single VM Multiple VMs
Cost (per GB) $0.023 $0.10 $0.30
Best for Media, backups, logs Databases, OS Shared workloads

Rule of thumb: if your application uploads/downloads files via an API, use object storage.

Pre-Signed URLs

Problem: routing file uploads and downloads through your API servers creates a bandwidth bottleneck. A 500MB video upload ties up a server connection, consumes memory, and wastes compute — your server is just proxying bytes.

Solution: generate a time-limited, cryptographically signed URL that authorizes the client to upload or download directly to/from the object store.

Pre-Signed Upload Flow

  1. Client requests an upload URL from your API (sends filename, content type, size)
  2. Your server validates the request, generates a pre-signed PUT URL for a specific S3 key
  3. Client uploads the file directly to S3 using the signed URL
  4. Client calls your API with the object key, confirming the upload finished (e.g., POST /api/uploads/complete { key, filename, size })
  5. Server calls HeadObject to verify the file exists and matches expected size, then records metadata in the database

Pre-Signed Download Flow

  1. Client requests access to a file
  2. Server checks authorization, generates a pre-signed GET URL (expires in 15 minutes)
  3. Client downloads directly from S3 (or CDN)

Signed URL Security Constraints

  • URL expires after a configurable duration (typically 5-60 minutes)
  • Scoped to a specific object key — cannot access other files
  • Scoped to a specific HTTP method (PUT or GET, not both)
  • Can enforce content-type and content-length constraints
  • Revocation: you can't revoke a signed URL, but short expiry limits exposure

Pre-Signed POST Policies

Pre-signed URLs are scoped to a single object key and HTTP method. POST policies are more flexible — you define a set of conditions, and the client can upload any object matching those conditions without requesting a new URL each time.

The server generates a policy document (base64-encoded JSON) and a signature computed from that policy. The client includes the policy, signature, and the file in a multipart form POST directly to the S3 bucket endpoint.

POST Policy Conditions

  • Key prefix — e.g., uploads/user-123/ allows multiple files under that prefix
  • Content-length range — e.g., 1 byte to 50MB
  • Content-type — restrict to image/* or video/mp4
  • Expiration — policy becomes invalid after a timestamp

POST Policy Use Cases

  • Batch uploads — mobile app uploading 20 photos without 20 round-trips for signed URLs
  • Browser form uploads — HTML <form> POSTing directly to S3
  • User-generated content — let users upload under their own prefix with size/type constraints

Pre-Signed PUT URL vs POST Policy

Pre-Signed PUT URL POST Policy
Scope Exact key + method Prefix + conditions
Multiple uploads One URL per file One policy, many files
Method PUT POST (multipart form)
Complexity Simple More setup
Browser forms
Size enforcement Content-Length header only Policy-level min/max

PUT URL vs POST Policy Tradeoffs

POST policies are more powerful but harder to debug — a single mismatched condition silently rejects the upload with a generic 403. Pre-signed PUT URLs are simpler when you know the exact key upfront. Use POST policies when clients need to upload multiple files or when you want server-side size/type enforcement without a proxy.

Multipart Uploads

Problem: uploading large files (>100MB) over unreliable networks fails frequently. A single network hiccup means restarting the entire upload from scratch.

Solution: split the file into parts (5MB–5GB each), upload them independently, then instruct the storage service to assemble them into a single object.

Multipart Upload Flow

  1. Initiate — request a multipart upload, receive an upload ID
  2. Upload parts — upload each chunk in parallel, receive an ETag per part
  3. Complete — send the list of parts + ETags, S3 assembles the final object

Multipart Upload Benefits

  • Parallel uploads — saturate bandwidth by uploading 4-8 parts simultaneously
  • Resumable — if part 7 of 20 fails, retry only part 7
  • Required for large files — S3 limits single PUT to 5GB; multipart supports up to 5TB
  • Progress tracking — report completion percentage based on parts uploaded

Incomplete Upload Cleanup

If an upload is abandoned, parts consume storage. Set a lifecycle rule to auto-delete incomplete multipart uploads after 7 days.

Storage Tiers and Lifecycle

Not all data is accessed equally. Storage tiers let you optimize cost based on access frequency:

Tier Access Cost (storage) Retrieval Use Case
Standard Frequent $0.023/GB Free Active user files
Infrequent Access Monthly $0.0125/GB $0.01/GB Old reports, logs
Glacier Instant Quarterly $0.004/GB $0.03/GB Compliance archives
Glacier Deep Yearly $0.00099/GB $0.02/GB, 12-48hr Legal retention

Lifecycle Policies

Automate transitions based on object age:

  • After 30 days → move to Infrequent Access
  • After 90 days → move to Glacier Instant Retrieval
  • After 365 days → move to Glacier Deep Archive
  • After 7 years → delete (compliance retention met)

This runs automatically — no application code needed.

CDN Integration

For read-heavy workloads (profile images, course videos, static assets), serve objects through a CDN:

  • Origin: your S3 bucket
  • Edge locations: cache popular objects close to users (50-300ms → 5-20ms)
  • Cache invalidation: use versioned keys (avatar-v3.png) instead of purging

Private Content via CDN

For authenticated content (paid videos, user documents):

  • Signed URLs: CDN generates short-lived URLs per request
  • Signed cookies: grant access to multiple objects (e.g., all videos in a course)
  • Origin Access Control: block direct S3 access, force all reads through CDN

This gives you both security and performance — users can't bypass the CDN to hit S3 directly.

Key Takeaways

  • Use object storage for unstructured data — it's cheap, scalable, and accessed via HTTP, making it ideal for user uploads, media, and backups.
  • Never proxy file uploads through your servers — use pre-signed URLs to let clients upload/download directly to S3, saving bandwidth and compute.
  • Use multipart uploads for large files — they enable parallel transfer, resumability, and are required for files over 5GB.
  • Implement lifecycle policies from day one — automatically transition cold data to cheaper tiers; this compounds into significant savings.
  • Pair object storage with a CDN — for any read-heavy access pattern, edge caching reduces latency and offloads origin traffic.
  • Signed URLs are your access control layer — short expiry, scoped keys, and method restrictions replace complex proxy authentication.

📝 Ready to test your knowledge?

Answer the quiz below to mark this lesson complete.

Spot something off? Report an issue

© 2026 ByteLearn.dev. Free courses for developers. · Privacy