Sharding is not a performance optimization. It's failure containment once a single node is physically exhausted — and every sharding problem reduces to shard-key regret.
You only earn the right to shard after vertical scaling, caching, and read replicas have been honestly considered. After that, the conversation is no longer about throughput — it's about which coordination tax you're willing to pay forever.
Sharding
"Sharding optimizes writes and destroys reads unless you're careful."
What sharding costs you (forever)
The numbers below are what an L6 calibration looks like — quoted to fork architectural decisions, not to fill a slide.
Why shard?
Sharding is the answer when the problem is physical exhaustion, not slowness. Slowness has cheaper answers — indexing, caching, read replicas, a bigger box. Exhaustion means working set, write IOPS, or storage have exceeded what any single node sells. What you get: linear scale for writes and storage, smaller per-node working sets, fault isolation. What you pay is the whole next section.
When to shard — the constraint tree
Don't compare features. Compare which resource is exhausted. Sharding fixes exactly one bottleneck at a time, and the shape of the shard key has to match the resource being relieved.
flowchart LR
Start["Is a single node insufficient?"] --> Bottleneck{"Which resource is exhausted?"}
Bottleneck -- Memory --> RAM["Working set > node RAM?"]
Bottleneck -- Compute --> CPU["Query load saturating cores?"]
Bottleneck -- Storage --> Disk["Write volume > disk BW?"]
RAM --> RAMFirst["Try first: bigger instance, hot/warm tiering, Redis/Memcached, column pruning"]
CPU --> CPUFirst["Try first: read replicas, query optimization, connection pooling"]
Disk --> DiskFirst["Try first: separate WAL disk, NVMe upgrade, batch writes"]
RAMFirst -- Still stuck --> RAMShard["Shard by access pattern; hot data on fewer shards"]
CPUFirst -- Still stuck --> CPUShard["Shard to distribute query load across cores"]
DiskFirst -- Still stuck --> DiskShard["Shard to distribute write I/O across disks"]
The sharding tax — what you sign for, in writing
- Cross-shard queries. Scatter-gather adds
(N−1) × network_RTTlatency minimum, and tail-latency dominates the result. - Distributed transactions. 2PC or eventual consistency. Pick your poison; both are expensive.
- Rebalancing operations. Moving data between shards is never zero-downtime. The shard that needs splitting is already on fire.
- Schema migrations.
ALTER TABLEnow runs on N shards, often sequentially. - Operational complexity. N databases to monitor, backup, failover. Forever.
- JOIN limitations. Cross-shard JOINs are either impossible or prohibitively expensive.
Capacity planning — how many shards?
Don't guess. Take the per-shard ceiling, your target utilization, and your largest dimension. Take the max, round up to a power-of-two-friendly logical shard count, map to fewer physical nodes.
Worked example — multi-tenant SaaS, 5-year horizon
Expecting 50K writes/sec sustained, 30TB storage in 5 years. Per-shard ceiling: ~10K writes/sec, ~10TB on a comfortable MySQL instance. Target utilization 60% to absorb traffic spikes and leave headroom for splits.
writes shards= 50,000 / (10,000 × 0.6) ≈ 9
storage shards= 30 TB / (10 TB × 0.6) = 5
physical N= max(9, 5) = 9 → round up → 16 nodes (room to grow)
logical N= 1024 (64 per physical) → resharding is remapping, not migration
9 physical shards now, room to scale to 16 without touching data; logical layer absorbs all growth past that.
Sharding architectures
Standard sharding
Single key like user_id or tenant_id. Simple. Breaks the moment one tenant grows beyond a node — the "whale tenant" problem. Now you must shard within that tenant or move them to dedicated hardware.
Hierarchical (Notion / Slack pattern)
Region → cell (pod with app+DB+cache) → shard inside the cell by tenant_id. Each cell is a self-contained failure domain; a bad deploy kills one cell, not the planet. Compliance bonus: data residency falls out of the region split.
flowchart LR
User([User]) --> Geo{Geo route}
Geo -->|Americas| USE[US-East]
Geo -->|Europe| EU[EU-Central]
Geo -->|Asia| AP[AP-South]
USE --> CellA[Cell A]
USE --> CellB[Cell B]
CellA --> Shard{Shard by tenant_id}
Shard --> Data[(Tenant shards)]
Sharding for multi-tenancy
Beyond scale, sharding by tenant gives you four bonuses you'd otherwise build by hand: resource isolation (one tenant's expensive ops don't slow others), permission isolation (access lives at the shard boundary), per-tenant backup/restore (recover one customer without touching others), and regulatory compliance for free (GDPR, data residency, right to deletion all become single-shard operations).
The catches: every tenant must fit on a single node, small tenants make the per-shard overhead embarrassing, and growing tenants will eventually need to migrate between shards.
Shard key selection
Properties of a good shard key
| Criterion | Good sign | Red flag |
|---|---|---|
| Cardinality | Millions of unique values | < 1000 unique values |
| Frequency | Uniform distribution | 80/20 Zipfian skew |
| Monotonicity | Random / UUID-based | Auto-increment, timestamp |
| Query alignment | 90%+ queries include shard key | Frequent cross-shard joins |
| Stability | Rarely changes | User can modify it |
Bad shard keys in the wild: country, status, created_at, anything with celebrity behavior. Good shard keys for SaaS: (tenant_id, user_id) compound — single-shard queries within a tenant, distribution across the population.
10 workloads, 10 keys
Grouped by access-pattern shape. The category is the lesson: most workloads cluster into one of four shapes, and knowing which shape yours resembles is half the answer to picking a key.
The key is the entity. Every request names a tenant, account, user, or author; sharding by that ID makes the request shard-local. The dominant shape — most production workloads land here.
Multi-tenant SaaS (Slack, Notion)
(tenant_id, user_id)
Tenant-scoped queries hit a single shard. Per-tenant backup, GDPR deletion, and cross-tenant isolation fall out for free.
Gotcha: whale tenants (100K+ users) need sub-sharding within the tenant — Slack's #general story. Have the plan before it arrives.
Social feed posts (Twitter, Instagram)
hash on author_id
90% of reads are "show this author's recent posts" — single shard. Fanout for timelines happens above the storage layer.
Gotcha: celebrity authors melt one shard. Cache hot accounts; salt the truly viral ones; accept this is a permanent ops chore.
Banking / payments (Stripe, Square)
account_id
Every transaction touches one account → single-shard ACID for free. Consistency is the product, not the implementation.
Gotcha: cross-account transfers need 2PC or saga. Architect so this stays <5% of traffic — otherwise the design is wrong.
E-commerce orders (Amazon, Shopify)
user_id (not order_id)
90% of order reads are "show this customer's history." Shard by the query pattern, not by the primary key. Order ID is unique, not load-bearing for routing.
Gotcha: rare admin lookups by order_id need a global secondary index or scatter-gather. Acceptable cost.
Single entity ID would grow unbounded — append the time bucket (hour, day, week) to cap any one partition's size. The bucket size is the knob: too small fans out reads, too big rebuilds the hot-partition problem.
Group chat messages (Discord, WhatsApp)
(chat_id, time_bucket)
Messages in one chat are read together; co-locate them. The time bucket caps any single partition's size as the chat grows.
Gotcha: bucket size is the knob — too small fans out reads, too big rebuilds Discord's multi-GB-partition problem.
Time-series metrics (Datadog, Prometheus)
range on (metric_id, timestamp)
Reads are "last N minutes/hours" — range queries dominate. Retention is a range drop. Hash would scatter every query.
Gotcha: "today's" range is always hot — pre-split it; rotate cold ranges to cheaper tier.
IoT telemetry (sensor fleets)
(device_id, time_bucket)
Each device's time series is read independently for dashboards and anomaly detection. Bucketing bounds growth per device.
Gotcha: cardinality grows with the fleet — over-provision logical shards from day one; new device rollouts can otherwise force resharding.
Locality is the access pattern. Co-locate by geographic cell so "find nearby X" stays cheap. Distinct from the others because the shard key encodes physical proximity, not identity.
Ride-sharing dispatch (Uber, Lyft)
geo-cell (H3 / S2, resolution ~8–9)
"Find nearby drivers" is a spatial-locality query — co-locating drivers and riders by cell keeps the match O(cell), not O(fleet).
Gotcha: dense cells (airport, stadium) are hot — H3 lets you recurse into child cells; without that, you're back to celebrity-key territory.
Two completely different reasons to hash and stop. URL shortener: each request is a trivial single-key lookup with no skew. Search: every read fans out across all shards by design — scatter-gather is the access pattern, not a tax.
URL shortener (bit.ly, tinyurl)
hash on short_code
Every request is a single-key lookup. No ranges, no joins, no celebrity skew once codes are randomly assigned.
Gotcha: almost none — this is the "boring and correct" case. Use hash slots and move on to the cache layer.
Search index (Elasticsearch, Solr)
hash on doc_id
Writes spread uniformly; reads are scatter-gather by design — every shard contributes to ranking, coordinator merges top-K. Different mental model: scatter is the feature, not a tax.
Gotcha: use the routing parameter to co-locate related docs (per-tenant search) and skip the scatter where you can.
ID generation in a sharded world
The moment you shard, AUTO_INCREMENT dies. Each shard issues its own monotonic counter; collisions and ordering become your problem. The new question: where does an ID get assigned, and what does it encode?
128 bits, fully random.
Pros: zero coordination, collisions vanishingly rare.
Cons: no sort order (kills range scans on ID), kills B-tree write locality (random inserts thrash the buffer pool), 16 bytes per row × 1B rows = 16 GB just on the PK.
Time-ordered, lexicographically sortable, still random in the suffix.
Pros: sort by ID ≈ sort by time (huge for log-like data); good B-tree locality on inserts.
Cons: still 128 bits; the time-prefix leaks creation time to anyone with the ID.
Reference: ULID spec.
Twitter's pattern: timestamp(41) + machine(10) + sequence(12).
Pros: 64 bits fits a BIGINT, sortable by time, encodes the issuing node.
Cons: needs unique machine IDs (clock-skew protocol or central allocator), sequence overflow at 4096/ms per node.
A central service hands each shard a range of IDs (e.g., 10,000 at a time).
Pros: dense IDs, smallest possible (INT or BIGINT), works with any DB.
Cons: allocator is a SPOF; per-shard caches mean IDs aren't globally sortable; reclaiming unused ranges is an ops chore.
Sharding algorithms
Five mainstream choices. Pick by access pattern, not familiarity.
Ordered keyspace split into contiguous ranges, each owned by one shard. Route via binary search on range boundaries — O(log S).
Pros: efficient range scans (touch only overlapping ranges); natural ordering preserved.
Cons: monotonic keys (timestamps, auto-increment) collapse into one shard. Splitting the hot shard is painful because the shard that needs splitting is already on fire.
Used by: CockroachDB, TiDB, HBase, Spanner.
Picks itself when: high-cardinality non-monotonic keys, range queries dominate, and you have headroom for the split mechanic.
shard_id = hash(key) % N. Uniform if the hash is good; no single-range hotspots.
Fatal flaw: resize changes N ⇒ full re-hash of every key. Cluster grows by one node, ~(N−1)/N of the data moves. Range queries die.
Use only when: dataset is immutable and the cluster size is fixed. In practice, never pick this in 2026. It's pedagogy.
128-bit hash space sliced into contiguous ranges; assign hash ranges to shards. Like range sharding, but applied to hashes rather than raw keys.
Pros: scale-out = split a hot hash range and move only that slice (localized rebalance). Uniform load without virtual nodes. This is the most common modern production choice.
Cons: original key ordering is lost for scans; requires a metadata service for range → shard mapping.
Used by: the Dynamo / Cassandra family. (Dynamo paper · SOSP 2007.)
Fixed number of logical slots (Redis uses 16,384) decoupled from node count.
Key → CRC16(key) % 16384 → Slot → Node
Slots 0–5460 → Node A
Slots 5461–10922 → Node B
Slots 10923–16383 → Node C
Pros: cheap rebalancing (move slots, not rehash keys); tiny metadata footprint (~64KB for 16K slots); operationally simple.
Cons: still breaks range scans; slot migration consumes bandwidth.
L6 insight: pick a slot count divisible by many factors (2520, 5040, 16384) so the dataset can be split cleanly across various node counts.
Reference: Redis Cluster scaling docs.
Map nodes and keys to a hash ring (0 to 2³²−1). Each key belongs to the next clockwise node.
Pros: adding/removing a node moves only ~1/N of keys. Minimal disruption.
Cons: memory-heavy — you need 100–1000 virtual nodes per physical node for even distribution. Lookup is O(log V) where V = total virtual nodes. Harder to reason about under skew.
Used by: DynamoDB, Cassandra, Riak. Modern systems increasingly prefer hash ranges or slots over classic consistent hashing.
Request routing
Once a key has a shard, something has to find that shard. There are really only two routing architectures, with one hybrid.
Smart client · 0 hops
Client library holds the shard map. Hashes locally, connects directly. Pros: lowest latency, no central bottleneck. Cons: infra changes require client updates; 10K clients refreshing metadata simultaneously can DDoS the coordination service.
Proxy / router · +1 hop
Client → proxy (Mongos, Envoy, VTGate). Proxy holds the map. Pros: decoupled — app doesn't know topology; pooling managed centrally. Cons: extra hop; the proxy becomes a SPOF or bottleneck.
Random + forward · hidden hop
Client hits any node; node forwards if it doesn't own the key. Pros: simplest, no client awareness needed. Cons: hidden hop amplifies tail latency. Used by Redis Cluster.
Metadata management
Two architectures, two failure profiles. Pick the one whose failure mode you can afford.
One source of truth · O(1) propagation
Strong consistency. Updates broadcast immediately. Risk: ZK outage = cluster-wide routing failure. Mitigate by caching the map locally with TTL and serving stale during refresh; never put ZK on the read hot path.
No single source · O(log N) convergence
Nodes share state randomly. Eventually consistent. Used by Cassandra, Riak. Risk: convergence time — Node A thinks key is on X, Node B on Y. Reconciled by version vectors and read repair. No single point of failure; no global authoritative answer either.
How real systems route
| System | Routing layer | Mechanism |
|---|---|---|
| Spanner | Client library → Span server | Metadata lookup via cached directory service. Splits transparent. |
| CockroachDB | Any node accepts any request | Range metadata via gossip + range cache. Forwards if wrong; cost = 1 extra hop worst case. |
| Vitess | VTGate proxy | VTGate reads vschema to map vindex → shard → MySQL. |
| Cassandra | Coordinator node | Coordinator hashes partition key, consults token ring. Token-aware client avoids the extra hop. |
| DynamoDB | AWS-managed routers | Request routers consult partition metadata. Fully invisible to caller. |
Sharding compute, not data — Slicer
When candidates say "sharding," they almost always mean data sharding — partitioning rows across DB instances. But there's a parallel problem: partitioning a fleet of stateful workers across requests. Examples that come up constantly:
- Which chat server holds a user's WebSocket?
- Which game-server process is hosting a match?
- Which Flink task owns a keyed window's in-memory state?
- Which session-cache node has a logged-in user's session?
The "shard key" is the user / game / session ID; the "shard" is a worker process; the routing problem is "find the worker that currently owns this key." Whether the data store underneath is itself sharded is a separate question.
Slicer is Google's internal service for this exact problem (OSDI 2016 paper, still in production). It uses a two-tier pattern that has since become the canonical answer:
Assigner — central, authoritative
Small, single logical instance per service. Decides "which worker owns which key range" based on load signals. The source of truth. Off the request hot path.
Distributors — many, cached
Regional fleet that serves assignment lookups to clients at low latency. Subscribe to the Assigner; never authoritative themselves. The actual scale-out tier.
Clients hit the nearest Distributor, get the worker assignment, RPC the worker directly. When the Assigner rebalances or fails a worker over, the change propagates through Distributors to clients. Strong leasing guarantees exactly one worker owns each key at any moment.
How real systems reshard
This is where the architectural personality shows. Range-split systems do it continuously and transparently. Hash-ring systems never truly reshard — they rebalance. Proxy-sharded systems require operational ceremonies. Same problem, five different answers.
Model: range-based splits on primary key, fully automatic. Spanner calls them "splits." The system decides where to cut based on load and size.
How it reshards: transparent. Data moves between Paxos groups without downtime; TrueTime gives globally consistent timestamps so even during a split migration, reads see a consistent snapshot. Split happens at the storage layer; the transaction layer doesn't notice.
Reference: Spanner paper (OSDI 2012).
Model: range-based (64MB default range size), inspired by Spanner. Each range is a Raft group (3 or 5 replicas).
How it reshards: automatic split/merge — but without TrueTime. CockroachDB uses HLCs (hybrid logical clocks) with a 500ms clock-skew bound. Transactions inside the uncertainty interval must wait.
Visible difference vs Spanner: Spanner's uncertainty is ~7ms (atomic clocks); CockroachDB's is up to ~500ms. For most workloads that's fine; for global financial ledgers it's a real constraint.
Reference: CockroachDB transaction-layer docs (HLC, uncertainty intervals).
Model: hash- or range-based on a "vindex." Each shard is a full MySQL instance. VTGate proxy routes queries.
How it reshards: a workflow, not a feature. Provision target shards → VReplication streams binlogs from source → catch up → brief write-stop (1–3 seconds) for cutover → drop old shards.
The honest line: Vitess is the answer to "how do you scale MySQL" — it works, but resharding is an operation. You schedule it, monitor it, and hold your breath during cutover. Cross-shard transactions use advisory 2PC; a VTGate crash mid-2PC needs manual resolution.
Reference: vitess.io.
Model: consistent hashing on partition key; each node owns token ranges via virtual nodes (256 vnodes default).
How it reshards: it doesn't, really. Cassandra never splits a shard — it rebalances token ownership. Your partition key choice is permanent. If you chose badly and have hot partitions, your options are (a) add a synthetic suffix (bucketing), or (b) migrate to a new table.
2024 update: ScyllaDB (the Cassandra-compatible reimplementation) migrated from vnodes to tablets — dynamically-split ranges closer to Spanner's model. If you're picking a new system in this family, Scylla-on-tablets gives you elastic resharding that Cassandra-on-vnodes still doesn't.
References: cassandra.apache.org · scylladb.com.
Model: hash-based, fully managed. DynamoDB auto-splits when a partition exceeds 10GB or sustained throughput limits.
How it reshards: invisible — until it isn't.
The leak in the abstraction: throughput is allocated per-partition, not per-table. Provision 10K WCU on a table with 10 partitions → each partition gets ~1K WCU. Skewed traffic? One partition throttles while nine sit idle. The 2018 adaptive-capacity fix redistributes unused capacity to hot partitions — a direct response to years of "hot partition throttling" complaints. Pre-2018, the recommended pattern was random-suffix partition keys, which made reads harder.
Reference: DynamoDB partitions docs.
Model: hash-based (default) or range-based on a chosen shard key. Each shard is a replica set; mongos proxy routes queries.
How it reshards: historically a nightmare — changing the shard key required dump-and-restore. MongoDB 5.0 (2021) added online resharding: change the shard key without downtime. Behind the scenes it clones data to new shards via change streams and atomically swaps the routing map at cutover. Production-stable as of 6.x / 7.x.
Catch: sharded transactions across shard boundaries use 2PC and are noticeably slower; most MongoDB deployments avoid them by schema-aligning operations to the shard key. The same "design transactions to be shard-local" rule applies as everywhere else.
Where it fits: MongoDB is one of the most common production sharded systems and the easiest sell when the team is already on document storage. The 5.0 online-reshard feature is the single biggest reason MongoDB stopped being a "you'll regret this in 18 months" choice for new sharded deployments.
Reference: MongoDB online resharding docs.
Handling skewed workloads & hot spots
Detect — per shard, not cluster-wide
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| CPU skew (max/avg) | < 1.3× | 1.3–2× | > 2× |
| Latency skew (p99 max/min) | < 1.5× | 1.5–3× | > 3× |
| QPS skew (max/avg) | < 1.5× | 1.5–2.5× | > 2.5× |
| Storage growth skew | < 1.5× | 1.5–2× | > 2× |
Mitigate — three layers, in order
- Cache in front of hot keys. 5-second TTL absorbs ~90% of read-hot cases. Cheapest and reversible.
- Split the hot key. Append a random suffix (
bieber_0…bieber_7) to scatter writes across N shards. Reads now fan out and merge. Trade: write hotspot becomes read fan-out and tail-latency exposure. - Dedicated shard. Last resort. Operationally expensive snowflake.
Salt-factor lifecycle
Changing N from 10 to 50 is a migration in disguise. Old data is salted with old N, new data with new N. Reads must fan out to max(old_N, new_N) during transition. For the handful of entities that need re-salting, prefer eager migration with a brief read-from-both window over a permanent lazy dual-path.
Sharding × replication
In production, each shard is also replicated. That means every shard has leader election, replica lag, and failover behavior — and the failure modes multiply combinatorially.
Your design must specify, per shard: routing (shard → replica → leader), read consistency (leader-only, allow stale, bounded staleness), and failover semantics.
Secondary indexes in a sharded world
This is the most-asked follow-up after "tell me about sharding." There are two shapes, and the trade-off between them is exactly the trade-off between write amplification and read scalability.
Each shard indexes only its own data
Write path: one shard. Fast.
Read path: scatter-gather to all shards. Tail-latency amplification — the slowest shard becomes your p99.
Works well when: you want only some results, not all (LIMIT 10).
Index sharded by the indexed value
Write path: primary + remote index update. Write amplification.
Read path: single shard for exact match. Scales reads.
Trade: eventual consistency (index may lag); may need distributed transactions to keep primary and index in sync.
Rebalancing — the mechanics that determine downtime
The interview trap is saying "we move the data to the new node." That's a goal, not a mechanism. L6 reality: moving data while serving writes requires a five-phase state machine. Describe it.
Provision target. Open replication socket for the slot/range. Target enters IMPORTING state.
Iterate over source range, bulk-transfer snapshot to target. Source keeps serving reads and writes.
Source buffered or double-wrote during cold copy. Target replays the delta until lag < 1 second.
Brief lock (≤100ms). Source returns MOVED. Routing table updated atomically. Writes flow to target.
Source deletes migrated data asynchronously. Keep zombie data 7 days for rollback safety.
Real numbers
Moving a 100GB shard at a throttled 50MB/s: ~25 minutes cold copy, 3–5 minutes catch-up, 50–200ms cutover window. The 7-day zombie window is your rollback insurance — if the destination misbehaves post-migration, flip routing back; the source still has the data.
Why rebalancing kills production
- Competes with live traffic for I/O and network.
- Overloads the destination (cold cache, compaction spike).
- Triggers cascading hot spots when the LB sees inflated latency.
- Migration mid-way + node failure = data in limbo.
Always: throttle (cap to e.g. 50MB/s); rebalance during low-traffic windows; monitor bytes_in_flight, not just shard count; rate-limit on destination health; cap concurrent migrations to one or two.
Backups & snapshots across shards
Per-shard async
Each shard backs up on its own schedule. Cheap, parallel, no coordination. Restore is point-per-shard, not point-in-time globally. Fine if your business logic tolerates the drift — most do. Accept this default unless you've proved you can't.
Coordinated snapshot
A coordinator broadcasts a "snapshot at time T" marker; every shard records its state at T. Chandy-Lamport style if there's no global clock; in-flight cross-shard transactions need fencing. Operationally complex; produces an actually-consistent snapshot.
Global-timestamp snapshot
Spanner-style: read at a specific TrueTime. Snapshot is the union of per-shard reads-as-of-T. Zero coordination overhead because the timestamp is the coordination. Requires hardware (TrueTime) or a tight HLC bound (CockroachDB).
The L6 framing: ask what the restore looks like before you design the backup. Per-shard async restore = "rebuild from these N independent snapshots; expect orphaned references across shards; have a reconciliation job ready." Coordinated restore = "restore is atomic but the backup is slower and more fragile." Pick the failure mode you can afford.
Cross-shard operations
Cross-shard reads — the scatter-gather math
Single-shard query: 1 × (RTT + query_time) ≈ 5ms.
16-shard scatter: 16 × query_time + max(RTT) + merge ≈ 51ms.
The honest cost isn't the mean — it's the tail. The p99 of N independent shards is approximately the (1 − 1/(N×100))-th quantile of any one shard's distribution. With N = 256, you're asking each shard to be in its 99.996th percentile or better — which it isn't.
Mitigation — three real moves
Hedged requests
Fire to N+1 shards (or to replicas), use first N responses. Cuts tail at ~6% extra cluster load. Cheapest big win.
Per-shard timeout budgets
Don't let one shard block the whole response. Return partial results with an explicit completeness signal ("255/256 shards").
Push computation to shards
Aggregate at the edge, not the coordinator. Filter, count, top-K all at the shard. Coordinator merges small results, not raw rows.
Cross-shard writes — pick your poison
| Approach | Consistency | Latency | Complexity | When to use |
|---|---|---|---|---|
| 2PC | Strong | High (2+ RTT) | High — coordinator SPOF | Financial, inventory |
| Saga | Eventual (compensations) | Low (async) | Medium — design compensations | Order processing, bookings |
| Avoid | Strong within shard | Lowest | Lowest | 90% of cases, if designed right |
Distributed joins
The problem: SELECT * FROM orders JOIN users when tables are sharded by different keys. Three strategies, ranked by preference:
- Co-locate. Shard both tables by the same key. Join is shard-local.
- Broadcast. If one table is small (<1GB), replicate to all shards.
- Application-level decomposition. Query one, gather IDs, query the other with
WHERE id IN (...), merge in app.
The spectrum across systems goes from "the database handles it" (Spanner with interleaved tables, distributed SQL push-down + shuffle) to "you handle it" (Cassandra and DynamoDB have no joins — denormalize at write time). The more the database does, the more latency floor those operations pay.
Operational realities — battle stories
Five lessons paid for at scale. Each one is a free piece of context you can quote in an interview.
Slack — the channel-shard problem
Sharded MySQL by workspace. Worked until 100K-member workspaces had a single channel (#general) saturating one MySQL instance.
Lesson: Migrated to Vitess and introduced sub-workspace sharding for large workspaces. Took over a year. Any "single tenant" framing has a ceiling; have the sub-shard plan before you hit it.
Instagram — pre-allocate logical shards
Postgres sharded by user ID with a Snowflake-style ID scheme (timestamp + shard ID + sequence). Pre-allocated thousands of logical shards mapped to fewer physical ones.
Lesson: Over-provision logical shards from day one so resharding becomes remapping, not migrating. The cheapest insurance you'll ever buy on a shard map. (Instagram engineering post.)
Discord — hot partitions on popular servers
Messages in Cassandra, partitioned by (channel_id, bucket). Popular servers blew past the ~100MB practical partition limit, hit multi-GB. Tombstone accumulation caused read amplification and GC pauses.
Lesson: Migrated to ScyllaDB with smaller time-bucket windows; now building a custom Rust engine. Cassandra's partition limit is a real constraint, not a suggestion. (Discord engineering blog.)
Uber — manual resharding doesn't scale with the company
Built Schemaless: MySQL-backed sharded KV store. Resharding required a full table copy. At Uber's growth rate, this became a quarterly operational burden.
Lesson: Replaced it with DocStore (automatic shard splitting). If you're growing 10×/year, manual resharding is a full-time job for a team — and the team will quit.
Notion — sharding-as-cultural-change
Single Postgres → sharded by workspace_id. Many cross-workspace features (shared pages) meant every query path had to be audited; cross-workspace lookups became async fetches.
Lesson: ~2 years from decision to full rollout. Sharding is a product-shape change, not just an infra one — every engineer learns a new query discipline.
Failure modes
Five canonical failure modes. Each gets a name, an impact, a fix, and a sentence you can say out loud.
The celebrity key that melted a shard
Impact: positive feedback loop. CPU saturates → timeouts → retries → worse. p99 decouples from the rest of the cluster.
Fix: cache in front (5s TTL absorbs ~90% of read heat) → key-split with random suffix → dedicated shard as last resort.
"Cache absorbs read heat. Key splitting absorbs write heat. But splitting trades a write hotspot for read fan-out — so you need both."
The scatter-gather that wouldn't finish
Impact: 255 shards in 10ms, one in 3s, coordinator blocked. Memory spikes buffering 256 responses.
Fix: per-shard timeout budgets, hedged requests for the tail, push filtering to shards, reject queries missing the shard key at parse time.
"Scatter-gather latency is the slowest shard, not the average. Budgets, hedges, push-down."
The rebalance that took prod down with it
Impact: migration I/O saturates disks → latency spikes → LB thinks shards are unhealthy → triggers more migrations. Cascade.
Fix: throttle migration, consistent-hashing limits movement to ~1/N, atomic routing switch only after sync + checksum, idempotent + checkpointed migrations, cap concurrency.
"The most dangerous moment isn't a node dying — it's a rebalance. Throttle, checkpoint, atomic switch, never route until confirmed."
The routing table that pointed to a ghost
Impact: stale client keeps writing to a decommissioned shard. Real owner is elsewhere. Silent data divergence.
Fix: epoch fencing on every routing update (storage rejects epoch < current), tombstone decommissioned shards (actively reject + redirect), TTL on routing cache so clients re-validate.
"A zombie shard silently accepting writes is worse than a dead shard. Dead shards throw errors. Zombies corrupt data."
The metadata store that vanished
Impact: ZK down. New connections can't discover shards. Existing cache works — until a failover, then the map is wrong. Flying blind.
Fix: cache routing map locally with TTL (buys minutes to hours); degrade to static mode (no failover promotions, accept partial unavailability); never put ZK on the read hot path.
"A 3-node ZK cluster shouldn't be your actual availability ceiling. Cache, degrade, expire."
Interview traps
"Sharding improves performance."
Wrong. Sharding improves scalability by accepting coordination cost. Performance often gets worse due to network hops and scatter-gather. The right framing: sharding lets you handle more work concurrently, not finish any single piece of work faster.
"Consistent hashing solves hotspots."
It solves rebalancing, not hotspots. If user_id=taylorswift gets 10M QPS, that key still hits one node. Celebrity keys need salting or caching, not a different hash function.
"Hash sharding is better than range sharding."
Depends on access patterns. Hash gives uniform distribution but kills range queries. Range preserves locality but creates hotspots on recent data. Time-series → range with rotation. User data → hash on a high-cardinality key.
"More shards = better performance."
More shards = more scatter-gather overhead. There's an optimal count: too few overload, too many drown in coordination. Rule of thumb: each shard should comfortably handle ~10K QPS.
"We can re-shard later."
Re-sharding is one of the hardest distributed-systems operations. Stripe's took 18 months; Notion's took 2 years. Choose the shard key carefully upfront — high cardinality, aligned with access patterns. If unsure, shard by tenant_id.
"Global secondary indexes are always better."
They trade write complexity for read efficiency. GSIs eliminate scatter-gather reads but add write amplification and eventual-consistency lag. For write-heavy workloads with occasional secondary lookups, local indexes with scatter-gather may be simpler.
The final mental model
Storage engines
CPU vs I/O.
Replication
Consistency vs availability.
Sharding
Simplicity vs scale.
Followups they'll ask
How do you do COUNT(*) across all shards?
Two answers, both honest. (1) Scatter-gather with timeout + partial results, and label the result as approximate. (2) Maintain a pre-aggregated count in a metadata store, updated via async log consumers. For monitoring dashboards, approximate is fine; for billing, you need the materialized count. The #1 follow-up to any sharding discussion — have both ready.
How do you handle a schema migration across 100 shards?
Three rules. (1) Shadow-column pattern: add nullable, backfill async, then set NOT NULL — never a blocking ALTER TABLE on a hot shard. (2) Rolling migration in batches of ~5 shards with health checks between batches. (3) For MySQL, use pt-online-schema-change or gh-ost — shadow table + triggers + atomic rename. Parallel saturates the cluster; sequential takes hours. Batched-with-throttle is the sweet spot.
What if your shard key was wrong and you've been running for a year?
Three paths. (1) If the bad pattern is local to a few hot tenants, isolate those tenants on dedicated shards — surgical, not architectural. (2) If it's systemic, introduce a new shard key on a new physical layout and dual-write during migration — Stripe and Notion both did versions of this; budget months, not weeks. (3) If neither is viable, accept the scatter-gather tax and put a query cache in front to hide it. The honest answer in interviews: "resharding is the work, not the decision."
How does sharding interact with caching?
Two layers. Cache-aside per shard: each shard fronts its own Redis or Memcached node — the cache itself is sharded by the same key, so cache locality matches DB locality. Cross-shard query cache: for scatter-gather results, cache at the coordinator with short TTL since invalidation across N shards is hard. Beware: if a shard fails over, the cache may serve from a stale leader's POV — bound by your replication lag.
How do you implement transactions across shards if you have to?
In order of preference: (1) Don't — redesign so the operation is shard-local. (2) Saga with compensations if eventual consistency is acceptable and you can write idempotent reversals. (3) 2PC if you must have linearizability — and accept that the coordinator is a SPOF, the protocol blocks on coordinator failure, and the latency is at least two RTTs. Spanner-style true distributed transactions exist only because TrueTime makes timestamp ordering globally consistent — without specialized hardware, you're choosing between Saga and 2PC.
How do you handle a hot shard you can't split (e.g., a single tenant exceeded one node)?
Three moves, in order. (1) Vertical scale that one shard's node — buys time, doesn't help write IOPS once you hit the ceiling. (2) Sub-shard within the tenant — pick a sub-key (e.g., user_id within the tenant) and add a second-level shard map for that tenant only. Slack did this for large workspaces. (3) Functional decomposition — peel off a subdomain (e.g., search, messaging, analytics) into its own service with its own sharding. Painful but sometimes the only answer.
How do you reason about consistency on a sharded read replica?
Two clocks. (1) Per-shard consistency — each shard has its own replication lag; reading from replicas means you might see different per-shard "as-of" times. For a cross-shard query, the result is shard-skewed, not just stale. (2) Cross-shard consistency requires a global snapshot (Spanner's TrueTime, CockroachDB's HLC) — without one, scatter-gather across replicas can return causally impossible combinations. The interview line: "replicas give you stale reads per shard; sharded replicas give you skewed reads across shards."
What's the relationship between sharding and your monitoring?
All metrics become per-shard or they become useless. Cluster-wide averages hide the hot shard. Mandatory dashboards: per-shard QPS, CPU, latency p99, storage growth — plus the skew ratio for each (max/avg). Track scatter ratio explicitly. Alert on skew before it becomes throughput collapse. The single most important addition to a sharded system is observability on the routing layer itself: cache hit rate, metadata-store latency, epoch-mismatch counts.