Sharding is not a performance optimization. It's failure containment once a single node is physically exhausted — and every sharding problem reduces to shard-key regret.

You only earn the right to shard after vertical scaling, caching, and read replicas have been honestly considered. After that, the conversation is no longer about throughput — it's about which coordination tax you're willing to pay forever.

Sharding

"Sharding optimizes writes and destroys reads unless you're careful."

What sharding costs you (forever)

The numbers below are what an L6 calibration looks like — quoted to fork architectural decisions, not to fill a slide.

RAM on a single AWS u-12tb1 instance

"Can we just buy a bigger box?" almost always answers yes.

Per month for that 12 TB instance

Less than 3 engineers × 6 months on a sharding project.

Months — Stripe's re-shard, end to end

Choose the shard key once.

Why shard?

Sharding is the answer when the problem is physical exhaustion, not slowness. Slowness has cheaper answers — indexing, caching, read replicas, a bigger box. Exhaustion means working set, write IOPS, or storage have exceeded what any single node sells. What you get: linear scale for writes and storage, smaller per-node working sets, fault isolation. What you pay is the whole next section.

When to shard — the constraint tree

Don't compare features. Compare which resource is exhausted. Sharding fixes exactly one bottleneck at a time, and the shape of the shard key has to match the resource being relieved.

flowchart LR
    Start["Is a single node insufficient?"] --> Bottleneck{"Which resource is exhausted?"}
    Bottleneck -- Memory --> RAM["Working set > node RAM?"]
    Bottleneck -- Compute --> CPU["Query load saturating cores?"]
    Bottleneck -- Storage --> Disk["Write volume > disk BW?"]
    RAM --> RAMFirst["Try first: bigger instance, hot/warm tiering, Redis/Memcached, column pruning"]
    CPU --> CPUFirst["Try first: read replicas, query optimization, connection pooling"]
    Disk --> DiskFirst["Try first: separate WAL disk, NVMe upgrade, batch writes"]
    RAMFirst -- Still stuck --> RAMShard["Shard by access pattern; hot data on fewer shards"]
    CPUFirst -- Still stuck --> CPUShard["Shard to distribute query load across cores"]
    DiskFirst -- Still stuck --> DiskShard["Shard to distribute write I/O across disks"]

The sharding tax — what you sign for, in writing

Cross-shard queries. Scatter-gather adds (N−1) × network_RTT latency minimum, and tail-latency dominates the result.
Distributed transactions. 2PC or eventual consistency. Pick your poison; both are expensive.
Rebalancing operations. Moving data between shards is never zero-downtime. The shard that needs splitting is already on fire.
Schema migrations. ALTER TABLE now runs on N shards, often sequentially.
Operational complexity. N databases to monitor, backup, failover. Forever.
JOIN limitations. Cross-shard JOINs are either impossible or prohibitively expensive.

Capacity planning — how many shards?

Don't guess. Take the per-shard ceiling, your target utilization, and your largest dimension. Take the max, round up to a power-of-two-friendly logical shard count, map to fewer physical nodes.

Worked example — multi-tenant SaaS, 5-year horizon

Expecting 50K writes/sec sustained, 30TB storage in 5 years. Per-shard ceiling: ~10K writes/sec, ~10TB on a comfortable MySQL instance. Target utilization 60% to absorb traffic spikes and leave headroom for splits.

writes shards= 50,000 / (10,000 × 0.6) ≈ 9

storage shards= 30 TB / (10 TB × 0.6) = 5

physical N= max(9, 5) = 9 → round up → 16 nodes (room to grow)

logical N= 1024 (64 per physical) → resharding is remapping, not migration

9 physical shards now, room to scale to 16 without touching data; logical layer absorbs all growth past that.

Sharding architectures

Standard sharding

Single key like user_id or tenant_id. Simple. Breaks the moment one tenant grows beyond a node — the "whale tenant" problem. Now you must shard within that tenant or move them to dedicated hardware.

Hierarchical (Notion / Slack pattern)

Region → cell (pod with app+DB+cache) → shard inside the cell by tenant_id. Each cell is a self-contained failure domain; a bad deploy kills one cell, not the planet. Compliance bonus: data residency falls out of the region split.

flowchart LR
    User([User]) --> Geo{Geo route}
    Geo -->|Americas| USE[US-East]
    Geo -->|Europe| EU[EU-Central]
    Geo -->|Asia| AP[AP-South]
    USE --> CellA[Cell A]
    USE --> CellB[Cell B]
    CellA --> Shard{Shard by tenant_id}
    Shard --> Data[(Tenant shards)]

Sharding for multi-tenancy

Beyond scale, sharding by tenant gives you four bonuses you'd otherwise build by hand: resource isolation (one tenant's expensive ops don't slow others), permission isolation (access lives at the shard boundary), per-tenant backup/restore (recover one customer without touching others), and regulatory compliance for free (GDPR, data residency, right to deletion all become single-shard operations).

The catches: every tenant must fit on a single node, small tenants make the per-shard overhead embarrassing, and growing tenants will eventually need to migrate between shards.

Shard key selection

Properties of a good shard key

Criterion	Good sign	Red flag
Cardinality	Millions of unique values	< 1000 unique values
Frequency	Uniform distribution	80/20 Zipfian skew
Monotonicity	Random / UUID-based	Auto-increment, timestamp
Query alignment	90%+ queries include shard key	Frequent cross-shard joins
Stability	Rarely changes	User can modify it

Bad shard keys in the wild: country, status, created_at, anything with celebrity behavior. Good shard keys for SaaS: (tenant_id, user_id) compound — single-shard queries within a tenant, distribution across the population.

10 workloads, 10 keys

Grouped by access-pattern shape. The category is the lesson: most workloads cluster into one of four shapes, and knowing which shape yours resembles is half the answer to picking a key.

The key is the entity. Every request names a tenant, account, user, or author; sharding by that ID makes the request shard-local. The dominant shape — most production workloads land here.

Multi-tenant SaaS (Slack, Notion)

(tenant_id, user_id)

Tenant-scoped queries hit a single shard. Per-tenant backup, GDPR deletion, and cross-tenant isolation fall out for free.

Gotcha: whale tenants (100K+ users) need sub-sharding within the tenant — Slack's #general story. Have the plan before it arrives.

Social feed posts (Twitter, Instagram)

hash on author_id

90% of reads are "show this author's recent posts" — single shard. Fanout for timelines happens above the storage layer.

Gotcha: celebrity authors melt one shard. Cache hot accounts; salt the truly viral ones; accept this is a permanent ops chore.

Banking / payments (Stripe, Square)

account_id

Every transaction touches one account → single-shard ACID for free. Consistency is the product, not the implementation.

Gotcha: cross-account transfers need 2PC or saga. Architect so this stays <5% of traffic — otherwise the design is wrong.

E-commerce orders (Amazon, Shopify)

user_id (not order_id)

90% of order reads are "show this customer's history." Shard by the query pattern, not by the primary key. Order ID is unique, not load-bearing for routing.

Gotcha: rare admin lookups by order_id need a global secondary index or scatter-gather. Acceptable cost.

Single entity ID would grow unbounded — append the time bucket (hour, day, week) to cap any one partition's size. The bucket size is the knob: too small fans out reads, too big rebuilds the hot-partition problem.

Group chat messages (Discord, WhatsApp)

(chat_id, time_bucket)

Messages in one chat are read together; co-locate them. The time bucket caps any single partition's size as the chat grows.

Gotcha: bucket size is the knob — too small fans out reads, too big rebuilds Discord's multi-GB-partition problem.

Time-series metrics (Datadog, Prometheus)

range on (metric_id, timestamp)

Reads are "last N minutes/hours" — range queries dominate. Retention is a range drop. Hash would scatter every query.

Gotcha: "today's" range is always hot — pre-split it; rotate cold ranges to cheaper tier.

IoT telemetry (sensor fleets)

(device_id, time_bucket)

Each device's time series is read independently for dashboards and anomaly detection. Bucketing bounds growth per device.

Gotcha: cardinality grows with the fleet — over-provision logical shards from day one; new device rollouts can otherwise force resharding.

Locality is the access pattern. Co-locate by geographic cell so "find nearby X" stays cheap. Distinct from the others because the shard key encodes physical proximity, not identity.

Ride-sharing dispatch (Uber, Lyft)

geo-cell (H3 / S2, resolution ~8–9)

"Find nearby drivers" is a spatial-locality query — co-locating drivers and riders by cell keeps the match O(cell), not O(fleet).

Gotcha: dense cells (airport, stadium) are hot — H3 lets you recurse into child cells; without that, you're back to celebrity-key territory.

Two completely different reasons to hash and stop. URL shortener: each request is a trivial single-key lookup with no skew. Search: every read fans out across all shards by design — scatter-gather is the access pattern, not a tax.

URL shortener (bit.ly, tinyurl)

hash on short_code

Every request is a single-key lookup. No ranges, no joins, no celebrity skew once codes are randomly assigned.

Gotcha: almost none — this is the "boring and correct" case. Use hash slots and move on to the cache layer.

Search index (Elasticsearch, Solr)

hash on doc_id

Writes spread uniformly; reads are scatter-gather by design — every shard contributes to ranking, coordinator merges top-K. Different mental model: scatter is the feature, not a tax.

Gotcha: use the routing parameter to co-locate related docs (per-tenant search) and skip the scatter where you can.

ID generation in a sharded world

The moment you shard, AUTO_INCREMENT dies. Each shard issues its own monotonic counter; collisions and ordering become your problem. The new question: where does an ID get assigned, and what does it encode?

128 bits, fully random.

Pros: zero coordination, collisions vanishingly rare.

Cons: no sort order (kills range scans on ID), kills B-tree write locality (random inserts thrash the buffer pool), 16 bytes per row × 1B rows = 16 GB just on the PK.

Time-ordered, lexicographically sortable, still random in the suffix.

Pros: sort by ID ≈ sort by time (huge for log-like data); good B-tree locality on inserts.

Cons: still 128 bits; the time-prefix leaks creation time to anyone with the ID.

Reference: ULID spec.

Twitter's pattern: timestamp(41) + machine(10) + sequence(12).

Pros: 64 bits fits a BIGINT, sortable by time, encodes the issuing node.

Cons: needs unique machine IDs (clock-skew protocol or central allocator), sequence overflow at 4096/ms per node.

A central service hands each shard a range of IDs (e.g., 10,000 at a time).

Pros: dense IDs, smallest possible (INT or BIGINT), works with any DB.

Cons: allocator is a SPOF; per-shard caches mean IDs aren't globally sortable; reclaiming unused ranges is an ops chore.

Sharding algorithms

Five mainstream choices. Pick by access pattern, not familiarity.

Ranges are contiguous; appends pile onto the tail shard.

Ordered keyspace split into contiguous ranges, each owned by one shard. Route via binary search on range boundaries — O(log S).

Pros: efficient range scans (touch only overlapping ranges); natural ordering preserved.

Cons: monotonic keys (timestamps, auto-increment) collapse into one shard. Splitting the hot shard is painful because the shard that needs splitting is already on fire.

Used by: CockroachDB, TiDB, HBase, Spanner.

Picks itself when: high-cardinality non-monotonic keys, range queries dominate, and you have headroom for the split mechanic.

One added node ≈ rehash the world. Why nobody picks this in 2026.

shard_id = hash(key) % N. Uniform if the hash is good; no single-range hotspots.

Fatal flaw: resize changes N ⇒ full re-hash of every key. Cluster grows by one node, ~(N−1)/N of the data moves. Range queries die.

Use only when: dataset is immutable and the cluster size is fixed. In practice, never pick this in 2026. It's pedagogy.

Localized rebalance — only the hot range's keys move.

128-bit hash space sliced into contiguous ranges; assign hash ranges to shards. Like range sharding, but applied to hashes rather than raw keys.

Pros: scale-out = split a hot hash range and move only that slice (localized rebalance). Uniform load without virtual nodes. This is the most common modern production choice.

Cons: original key ordering is lost for scans; requires a metadata service for range → shard mapping.

Used by: the Dynamo / Cassandra family. (Dynamo paper · SOSP 2007.)

Move a slot, not a key. Rebalance ⇒ slot map update, not data re-hash.

Fixed number of logical slots (Redis uses 16,384) decoupled from node count.

Key → CRC16(key) % 16384 → Slot → Node
Slots 0–5460       → Node A
Slots 5461–10922   → Node B
Slots 10923–16383  → Node C

Pros: cheap rebalancing (move slots, not rehash keys); tiny metadata footprint (~64KB for 16K slots); operationally simple.

Cons: still breaks range scans; slot migration consumes bandwidth.

L6 insight: pick a slot count divisible by many factors (2520, 5040, 16384) so the dataset can be split cleanly across various node counts.

Reference: Redis Cluster scaling docs.

Adding/removing a node displaces only ≈1/N of keys.

Map nodes and keys to a hash ring (0 to 2³²−1). Each key belongs to the next clockwise node.

Pros: adding/removing a node moves only ~1/N of keys. Minimal disruption.

Cons: memory-heavy — you need 100–1000 virtual nodes per physical node for even distribution. Lookup is O(log V) where V = total virtual nodes. Harder to reason about under skew.

Used by: DynamoDB, Cassandra, Riak. Modern systems increasingly prefer hash ranges or slots over classic consistent hashing.

Request routing

Once a key has a shard, something has to find that shard. There are really only two routing architectures, with one hybrid.

Smart client · 0 hops

Client library holds the shard map. Hashes locally, connects directly. Pros: lowest latency, no central bottleneck. Cons: infra changes require client updates; 10K clients refreshing metadata simultaneously can DDoS the coordination service.

Proxy / router · +1 hop

Client → proxy (Mongos, Envoy, VTGate). Proxy holds the map. Pros: decoupled — app doesn't know topology; pooling managed centrally. Cons: extra hop; the proxy becomes a SPOF or bottleneck.

Random + forward · hidden hop

Client hits any node; node forwards if it doesn't own the key. Pros: simplest, no client awareness needed. Cons: hidden hop amplifies tail latency. Used by Redis Cluster.

Metadata management

Two architectures, two failure profiles. Pick the one whose failure mode you can afford.

ZK / etcd · centralized

One source of truth · O(1) propagation

Strong consistency. Updates broadcast immediately. Risk: ZK outage = cluster-wide routing failure. Mitigate by caching the map locally with TTL and serving stale during refresh; never put ZK on the read hot path.

Gossip · decentralized

No single source · O(log N) convergence

Nodes share state randomly. Eventually consistent. Used by Cassandra, Riak. Risk: convergence time — Node A thinks key is on X, Node B on Y. Reconciled by version vectors and read repair. No single point of failure; no global authoritative answer either.

How real systems route

System	Routing layer	Mechanism
Spanner	Client library → Span server	Metadata lookup via cached directory service. Splits transparent.
CockroachDB	Any node accepts any request	Range metadata via gossip + range cache. Forwards if wrong; cost = 1 extra hop worst case.
Vitess	VTGate proxy	VTGate reads `vschema` to map vindex → shard → MySQL.
Cassandra	Coordinator node	Coordinator hashes partition key, consults token ring. Token-aware client avoids the extra hop.
DynamoDB	AWS-managed routers	Request routers consult partition metadata. Fully invisible to caller.

Sharding compute, not data — Slicer

Same word, two problems. Slicer solves the right one.

When candidates say "sharding," they almost always mean data sharding — partitioning rows across DB instances. But there's a parallel problem: partitioning a fleet of stateful workers across requests. Examples that come up constantly:

Which chat server holds a user's WebSocket?
Which game-server process is hosting a match?
Which Flink task owns a keyed window's in-memory state?
Which session-cache node has a logged-in user's session?

The "shard key" is the user / game / session ID; the "shard" is a worker process; the routing problem is "find the worker that currently owns this key." Whether the data store underneath is itself sharded is a separate question.

Slicer is Google's internal service for this exact problem (OSDI 2016 paper, still in production). It uses a two-tier pattern that has since become the canonical answer:

Assigner — central, authoritative

Small, single logical instance per service. Decides "which worker owns which key range" based on load signals. The source of truth. Off the request hot path.

Distributors — many, cached

Regional fleet that serves assignment lookups to clients at low latency. Subscribe to the Assigner; never authoritative themselves. The actual scale-out tier.

Clients hit the nearest Distributor, get the worker assignment, RPC the worker directly. When the Assigner rebalances or fails a worker over, the change propagates through Distributors to clients. Strong leasing guarantees exactly one worker owns each key at any moment.

How real systems reshard

This is where the architectural personality shows. Range-split systems do it continuously and transparently. Hash-ring systems never truly reshard — they rebalance. Proxy-sharded systems require operational ceremonies. Same problem, five different answers.

Model: range-based splits on primary key, fully automatic. Spanner calls them "splits." The system decides where to cut based on load and size.

How it reshards: transparent. Data moves between Paxos groups without downtime; TrueTime gives globally consistent timestamps so even during a split migration, reads see a consistent snapshot. Split happens at the storage layer; the transaction layer doesn't notice.

Reference: Spanner paper (OSDI 2012).

Model: range-based (64MB default range size), inspired by Spanner. Each range is a Raft group (3 or 5 replicas).

How it reshards: automatic split/merge — but without TrueTime. CockroachDB uses HLCs (hybrid logical clocks) with a 500ms clock-skew bound. Transactions inside the uncertainty interval must wait.

Visible difference vs Spanner: Spanner's uncertainty is ~7ms (atomic clocks); CockroachDB's is up to ~500ms. For most workloads that's fine; for global financial ledgers it's a real constraint.

Reference: CockroachDB transaction-layer docs (HLC, uncertainty intervals).

Model: hash- or range-based on a "vindex." Each shard is a full MySQL instance. VTGate proxy routes queries.

How it reshards: a workflow, not a feature. Provision target shards → VReplication streams binlogs from source → catch up → brief write-stop (1–3 seconds) for cutover → drop old shards.

The honest line: Vitess is the answer to "how do you scale MySQL" — it works, but resharding is an operation. You schedule it, monitor it, and hold your breath during cutover. Cross-shard transactions use advisory 2PC; a VTGate crash mid-2PC needs manual resolution.

Reference: vitess.io.

Model: consistent hashing on partition key; each node owns token ranges via virtual nodes (256 vnodes default).

How it reshards: it doesn't, really. Cassandra never splits a shard — it rebalances token ownership. Your partition key choice is permanent. If you chose badly and have hot partitions, your options are (a) add a synthetic suffix (bucketing), or (b) migrate to a new table.

2024 update: ScyllaDB (the Cassandra-compatible reimplementation) migrated from vnodes to tablets — dynamically-split ranges closer to Spanner's model. If you're picking a new system in this family, Scylla-on-tablets gives you elastic resharding that Cassandra-on-vnodes still doesn't.

References: cassandra.apache.org · scylladb.com.

Model: hash-based, fully managed. DynamoDB auto-splits when a partition exceeds 10GB or sustained throughput limits.

How it reshards: invisible — until it isn't.

The leak in the abstraction: throughput is allocated per-partition, not per-table. Provision 10K WCU on a table with 10 partitions → each partition gets ~1K WCU. Skewed traffic? One partition throttles while nine sit idle. The 2018 adaptive-capacity fix redistributes unused capacity to hot partitions — a direct response to years of "hot partition throttling" complaints. Pre-2018, the recommended pattern was random-suffix partition keys, which made reads harder.

Reference: DynamoDB partitions docs.

Model: hash-based (default) or range-based on a chosen shard key. Each shard is a replica set; mongos proxy routes queries.

How it reshards: historically a nightmare — changing the shard key required dump-and-restore. MongoDB 5.0 (2021) added online resharding: change the shard key without downtime. Behind the scenes it clones data to new shards via change streams and atomically swaps the routing map at cutover. Production-stable as of 6.x / 7.x.

Catch: sharded transactions across shard boundaries use 2PC and are noticeably slower; most MongoDB deployments avoid them by schema-aligning operations to the shard key. The same "design transactions to be shard-local" rule applies as everywhere else.

Where it fits: MongoDB is one of the most common production sharded systems and the easiest sell when the team is already on document storage. The 5.0 online-reshard feature is the single biggest reason MongoDB stopped being a "you'll regret this in 18 months" choice for new sharded deployments.

Reference: MongoDB online resharding docs.

Handling skewed workloads & hot spots

Detect — per shard, not cluster-wide

Metric	Healthy	Warning	Critical
CPU skew (max/avg)	< 1.3×	1.3–2×	> 2×
Latency skew (p99 max/min)	< 1.5×	1.5–3×	> 3×
QPS skew (max/avg)	< 1.5×	1.5–2.5×	> 2.5×
Storage growth skew	< 1.5×	1.5–2×	> 2×

Mitigate — three layers, in order

Cache in front of hot keys. 5-second TTL absorbs ~90% of read-hot cases. Cheapest and reversible.
Split the hot key. Append a random suffix (bieber_0…bieber_7) to scatter writes across N shards. Reads now fan out and merge. Trade: write hotspot becomes read fan-out and tail-latency exposure.
Dedicated shard. Last resort. Operationally expensive snowflake.

Salting trades a write hotspot for a read fan-out. The bottleneck didn't leave; it changed addresses.

Salt-factor lifecycle

Changing N from 10 to 50 is a migration in disguise. Old data is salted with old N, new data with new N. Reads must fan out to max(old_N, new_N) during transition. For the handful of entities that need re-salting, prefer eager migration with a brief read-from-both window over a permanent lazy dual-path.

Sharding × replication

In production, each shard is also replicated. That means every shard has leader election, replica lag, and failover behavior — and the failure modes multiply combinatorially.

Your design must specify, per shard: routing (shard → replica → leader), read consistency (leader-only, allow stale, bounded staleness), and failover semantics.

Secondary indexes in a sharded world

This is the most-asked follow-up after "tell me about sharding." There are two shapes, and the trade-off between them is exactly the trade-off between write amplification and read scalability.

Local (document-partitioned)

Each shard indexes only its own data

Write path: one shard. Fast.
Read path: scatter-gather to all shards. Tail-latency amplification — the slowest shard becomes your p99.

Works well when: you want only some results, not all (LIMIT 10).

Global (term-partitioned)

Index sharded by the indexed value

Write path: primary + remote index update. Write amplification.
Read path: single shard for exact match. Scales reads.

Trade: eventual consistency (index may lag); may need distributed transactions to keep primary and index in sync.

Rebalancing — the mechanics that determine downtime

The interview trap is saying "we move the data to the new node." That's a goal, not a mechanism. L6 reality: moving data while serving writes requires a five-phase state machine. Describe it.

Prepare

Provision target. Open replication socket for the slot/range. Target enters IMPORTING state.

Cold copy

Iterate over source range, bulk-transfer snapshot to target. Source keeps serving reads and writes.

Catch-up

Source buffered or double-wrote during cold copy. Target replays the delta until lag < 1 second.

Cutover

Brief lock (≤100ms). Source returns MOVED. Routing table updated atomically. Writes flow to target.

Cleanup

Source deletes migrated data asynchronously. Keep zombie data 7 days for rollback safety.

Real numbers

Moving a 100GB shard at a throttled 50MB/s: ~25 minutes cold copy, 3–5 minutes catch-up, 50–200ms cutover window. The 7-day zombie window is your rollback insurance — if the destination misbehaves post-migration, flip routing back; the source still has the data.

Why rebalancing kills production

Competes with live traffic for I/O and network.
Overloads the destination (cold cache, compaction spike).
Triggers cascading hot spots when the LB sees inflated latency.
Migration mid-way + node failure = data in limbo.

Always: throttle (cap to e.g. 50MB/s); rebalance during low-traffic windows; monitor bytes_in_flight, not just shard count; rate-limit on destination health; cap concurrent migrations to one or two.

Backups & snapshots across shards

Per-shard async

Each shard backs up on its own schedule. Cheap, parallel, no coordination. Restore is point-per-shard, not point-in-time globally. Fine if your business logic tolerates the drift — most do. Accept this default unless you've proved you can't.

Coordinated snapshot

A coordinator broadcasts a "snapshot at time T" marker; every shard records its state at T. Chandy-Lamport style if there's no global clock; in-flight cross-shard transactions need fencing. Operationally complex; produces an actually-consistent snapshot.

Global-timestamp snapshot

Spanner-style: read at a specific TrueTime. Snapshot is the union of per-shard reads-as-of-T. Zero coordination overhead because the timestamp is the coordination. Requires hardware (TrueTime) or a tight HLC bound (CockroachDB).

The L6 framing: ask what the restore looks like before you design the backup. Per-shard async restore = "rebuild from these N independent snapshots; expect orphaned references across shards; have a reconciliation job ready." Coordinated restore = "restore is atomic but the backup is slower and more fragile." Pick the failure mode you can afford.

Cross-shard operations

Cross-shard reads — the scatter-gather math

Single-shard query: 1 × (RTT + query_time) ≈ 5ms.
16-shard scatter: 16 × query_time + max(RTT) + merge ≈ 51ms.

The honest cost isn't the mean — it's the tail. The p99 of N independent shards is approximately the (1 − 1/(N×100))-th quantile of any one shard's distribution. With N = 256, you're asking each shard to be in its 99.996th percentile or better — which it isn't.

The slowest shard sets the response time. The mean is a lie.

p99(scatter) ≈ max(p99 of each shard) + merge time

Mitigation — three real moves

Hedged requests

Fire to N+1 shards (or to replicas), use first N responses. Cuts tail at ~6% extra cluster load. Cheapest big win.

Per-shard timeout budgets

Don't let one shard block the whole response. Return partial results with an explicit completeness signal ("255/256 shards").

Push computation to shards

Aggregate at the edge, not the coordinator. Filter, count, top-K all at the shard. Coordinator merges small results, not raw rows.

Cross-shard writes — pick your poison

Approach	Consistency	Latency	Complexity	When to use
2PC	Strong	High (2+ RTT)	High — coordinator SPOF	Financial, inventory
Saga	Eventual (compensations)	Low (async)	Medium — design compensations	Order processing, bookings
Avoid	Strong within shard	Lowest	Lowest	90% of cases, if designed right

Distributed joins

The problem: SELECT * FROM orders JOIN users when tables are sharded by different keys. Three strategies, ranked by preference:

Co-locate. Shard both tables by the same key. Join is shard-local.
Broadcast. If one table is small (<1GB), replicate to all shards.
Application-level decomposition. Query one, gather IDs, query the other with WHERE id IN (...), merge in app.

The spectrum across systems goes from "the database handles it" (Spanner with interleaved tables, distributed SQL push-down + shuffle) to "you handle it" (Cassandra and DynamoDB have no joins — denormalize at write time). The more the database does, the more latency floor those operations pay.

Operational realities — battle stories

Five lessons paid for at scale. Each one is a free piece of context you can quote in an interview.

Slack — the channel-shard problem

Sharded MySQL by workspace. Worked until 100K-member workspaces had a single channel (#general) saturating one MySQL instance.

Lesson: Migrated to Vitess and introduced sub-workspace sharding for large workspaces. Took over a year. Any "single tenant" framing has a ceiling; have the sub-shard plan before you hit it.

Instagram — pre-allocate logical shards

Postgres sharded by user ID with a Snowflake-style ID scheme (timestamp + shard ID + sequence). Pre-allocated thousands of logical shards mapped to fewer physical ones.

Lesson: Over-provision logical shards from day one so resharding becomes remapping, not migrating. The cheapest insurance you'll ever buy on a shard map. (Instagram engineering post.)

Discord — hot partitions on popular servers

Messages in Cassandra, partitioned by (channel_id, bucket). Popular servers blew past the ~100MB practical partition limit, hit multi-GB. Tombstone accumulation caused read amplification and GC pauses.

Lesson: Migrated to ScyllaDB with smaller time-bucket windows; now building a custom Rust engine. Cassandra's partition limit is a real constraint, not a suggestion. (Discord engineering blog.)

Uber — manual resharding doesn't scale with the company

Built Schemaless: MySQL-backed sharded KV store. Resharding required a full table copy. At Uber's growth rate, this became a quarterly operational burden.

Lesson: Replaced it with DocStore (automatic shard splitting). If you're growing 10×/year, manual resharding is a full-time job for a team — and the team will quit.

Notion — sharding-as-cultural-change

Single Postgres → sharded by workspace_id. Many cross-workspace features (shared pages) meant every query path had to be audited; cross-workspace lookups became async fetches.

Lesson: ~2 years from decision to full rollout. Sharding is a product-shape change, not just an infra one — every engineer learns a new query discipline.

Failure modes

Five canonical failure modes. Each gets a name, an impact, a fix, and a sentence you can say out loud.

The celebrity key that melted a shard

Impact: positive feedback loop. CPU saturates → timeouts → retries → worse. p99 decouples from the rest of the cluster.

Fix: cache in front (5s TTL absorbs ~90% of read heat) → key-split with random suffix → dedicated shard as last resort.

"Cache absorbs read heat. Key splitting absorbs write heat. But splitting trades a write hotspot for read fan-out — so you need both."

The scatter-gather that wouldn't finish

Impact: 255 shards in 10ms, one in 3s, coordinator blocked. Memory spikes buffering 256 responses.

Fix: per-shard timeout budgets, hedged requests for the tail, push filtering to shards, reject queries missing the shard key at parse time.

"Scatter-gather latency is the slowest shard, not the average. Budgets, hedges, push-down."

The rebalance that took prod down with it

Impact: migration I/O saturates disks → latency spikes → LB thinks shards are unhealthy → triggers more migrations. Cascade.

Fix: throttle migration, consistent-hashing limits movement to ~1/N, atomic routing switch only after sync + checksum, idempotent + checkpointed migrations, cap concurrency.

"The most dangerous moment isn't a node dying — it's a rebalance. Throttle, checkpoint, atomic switch, never route until confirmed."

The routing table that pointed to a ghost

Impact: stale client keeps writing to a decommissioned shard. Real owner is elsewhere. Silent data divergence.

Fix: epoch fencing on every routing update (storage rejects epoch < current), tombstone decommissioned shards (actively reject + redirect), TTL on routing cache so clients re-validate.

"A zombie shard silently accepting writes is worse than a dead shard. Dead shards throw errors. Zombies corrupt data."

The metadata store that vanished

Impact: ZK down. New connections can't discover shards. Existing cache works — until a failover, then the map is wrong. Flying blind.

Fix: cache routing map locally with TTL (buys minutes to hours); degrade to static mode (no failover promotions, accept partial unavailability); never put ZK on the read hot path.

"A 3-node ZK cluster shouldn't be your actual availability ceiling. Cache, degrade, expire."

Interview traps

"Sharding improves performance."

Wrong. Sharding improves scalability by accepting coordination cost. Performance often gets worse due to network hops and scatter-gather. The right framing: sharding lets you handle more work concurrently, not finish any single piece of work faster.

"Consistent hashing solves hotspots."

It solves rebalancing, not hotspots. If user_id=taylorswift gets 10M QPS, that key still hits one node. Celebrity keys need salting or caching, not a different hash function.

"Hash sharding is better than range sharding."

Depends on access patterns. Hash gives uniform distribution but kills range queries. Range preserves locality but creates hotspots on recent data. Time-series → range with rotation. User data → hash on a high-cardinality key.

"More shards = better performance."

More shards = more scatter-gather overhead. There's an optimal count: too few overload, too many drown in coordination. Rule of thumb: each shard should comfortably handle ~10K QPS.

"We can re-shard later."

Re-sharding is one of the hardest distributed-systems operations. Stripe's took 18 months; Notion's took 2 years. Choose the shard key carefully upfront — high cardinality, aligned with access patterns. If unsure, shard by tenant_id.

"Global secondary indexes are always better."

They trade write complexity for read efficiency. GSIs eliminate scatter-gather reads but add write amplification and eventual-consistency lag. For write-heavy workloads with occasional secondary lookups, local indexes with scatter-gather may be simpler.

The final mental model

Storage engines

CPU vs I/O.

Replication

Consistency vs availability.

Sharding

Simplicity vs scale.

shard early → inherit complexity forever | shard late → pay migration cost once

Followups they'll ask

How do you do COUNT(*) across all shards?

Two answers, both honest. (1) Scatter-gather with timeout + partial results, and label the result as approximate. (2) Maintain a pre-aggregated count in a metadata store, updated via async log consumers. For monitoring dashboards, approximate is fine; for billing, you need the materialized count. The #1 follow-up to any sharding discussion — have both ready.

How do you handle a schema migration across 100 shards?

Three rules. (1) Shadow-column pattern: add nullable, backfill async, then set NOT NULL — never a blocking ALTER TABLE on a hot shard. (2) Rolling migration in batches of ~5 shards with health checks between batches. (3) For MySQL, use pt-online-schema-change or gh-ost — shadow table + triggers + atomic rename. Parallel saturates the cluster; sequential takes hours. Batched-with-throttle is the sweet spot.

What if your shard key was wrong and you've been running for a year?

Three paths. (1) If the bad pattern is local to a few hot tenants, isolate those tenants on dedicated shards — surgical, not architectural. (2) If it's systemic, introduce a new shard key on a new physical layout and dual-write during migration — Stripe and Notion both did versions of this; budget months, not weeks. (3) If neither is viable, accept the scatter-gather tax and put a query cache in front to hide it. The honest answer in interviews: "resharding is the work, not the decision."

How does sharding interact with caching?

Two layers. Cache-aside per shard: each shard fronts its own Redis or Memcached node — the cache itself is sharded by the same key, so cache locality matches DB locality. Cross-shard query cache: for scatter-gather results, cache at the coordinator with short TTL since invalidation across N shards is hard. Beware: if a shard fails over, the cache may serve from a stale leader's POV — bound by your replication lag.

How do you implement transactions across shards if you have to?

In order of preference: (1) Don't — redesign so the operation is shard-local. (2) Saga with compensations if eventual consistency is acceptable and you can write idempotent reversals. (3) 2PC if you must have linearizability — and accept that the coordinator is a SPOF, the protocol blocks on coordinator failure, and the latency is at least two RTTs. Spanner-style true distributed transactions exist only because TrueTime makes timestamp ordering globally consistent — without specialized hardware, you're choosing between Saga and 2PC.

How do you handle a hot shard you can't split (e.g., a single tenant exceeded one node)?

Three moves, in order. (1) Vertical scale that one shard's node — buys time, doesn't help write IOPS once you hit the ceiling. (2) Sub-shard within the tenant — pick a sub-key (e.g., user_id within the tenant) and add a second-level shard map for that tenant only. Slack did this for large workspaces. (3) Functional decomposition — peel off a subdomain (e.g., search, messaging, analytics) into its own service with its own sharding. Painful but sometimes the only answer.

How do you reason about consistency on a sharded read replica?

Two clocks. (1) Per-shard consistency — each shard has its own replication lag; reading from replicas means you might see different per-shard "as-of" times. For a cross-shard query, the result is shard-skewed, not just stale. (2) Cross-shard consistency requires a global snapshot (Spanner's TrueTime, CockroachDB's HLC) — without one, scatter-gather across replicas can return causally impossible combinations. The interview line: "replicas give you stale reads per shard; sharded replicas give you skewed reads across shards."

What's the relationship between sharding and your monitoring?

All metrics become per-shard or they become useless. Cluster-wide averages hide the hot shard. Mandatory dashboards: per-shard QPS, CPU, latency p99, storage growth — plus the skew ratio for each (max/avg). Track scatter ratio explicitly. Alert on skew before it becomes throughput collapse. The single most important addition to a sharded system is observability on the routing layer itself: cache hit rate, metadata-store latency, epoch-mismatch counts.

Revision — flip before the interview

What does sharding actually fix, and what does it not?

Fixes: write throughput, storage capacity, working-set RAM exceeded on a single node. Does not fix: per-query latency, hotspots from celebrity keys, the need for distributed coordination. Sharding lets you handle more work concurrently; it makes any single piece of work slower.

Click to reveal

What's the rule for picking a shard key in one sentence?

"The best shard key appears in your WHERE clause 90% of the time AND distributes data uniformly. When those two goals conflict — and they always do — you denormalize."

Click to reveal

Why is "consistent hashing solves hotspots" wrong?

Consistent hashing solves rebalancing — adding/removing a node moves only ~1/N of keys. It does nothing about traffic skew. If a single key gets 10M QPS, that key still hits one node. Celebrity keys need salting or caching.

Click to reveal

What are the five phases of safely adding a shard?

Prepare (provision + open replication socket) → Cold copy (snapshot while source serves) → Catch-up (replay deltas until lag < 1s) → Cutover (brief pause, atomic routing switch, MOVED redirects) → Cleanup (async delete, keep zombie data 7 days for rollback).

Click to reveal

What's the difference between a local and a global secondary index, and what does each cost?

Local: each shard indexes only its data. Write = 1 shard. Read = scatter-gather to all shards (tail-latency amplification). Global: index sharded by indexed value. Write = primary + remote index update (write amplification, often async, eventual consistency). Read = single shard. Global indexes are distributed transactions in disguise — Spanner makes them strong; DynamoDB GSIs hide them behind eventual consistency.

Click to reveal

Why does Spanner reshard transparently while Cassandra "never" reshards?

Spanner uses range-based splits managed automatically — splits move data between Paxos groups, with TrueTime giving globally consistent snapshots so reads see no anomaly mid-migration. Cassandra uses consistent hashing on partition key — token ownership rebalances when nodes are added, but the partition-key choice itself is permanent. To "reshard" Cassandra you migrate to a new table with a different key.

Click to reveal

Why is the p99 of a scatter-gather over N shards so much worse than the p99 of one shard?

Because p99(scatter) ≈ max of N shard latencies + merge time. If each shard has p99 = X, the N-way scatter's p99 approaches the (1 − 1/(N×100))-th quantile of each shard's distribution — i.e., you're sampling each shard's tail every time. With N=256, you need each shard at its 99.996th percentile or better, which it isn't.

Click to reveal

What three mitigations cut scatter-gather tail latency without changing the shard key?

(1) Hedged requests — send to N+1 (or to replicas), take first N. ~6% extra load. (2) Per-shard timeout budgets with partial results + explicit completeness signal. (3) Push computation to shards — aggregate, filter, top-K at the edge so coordinator merges small payloads, not raw rows.

Click to reveal

What's the "logical shards over physical shards" pattern, and which systems use it?

Pre-allocate many logical shards (Instagram thousands; Redis 16,384 slots) and map N logical shards to M<N physical nodes. Resharding becomes remapping (move logical slots between physical nodes) rather than data migration. Used by Instagram, Vitess, DynamoDB internally, and Redis Cluster. Over-provision logical shards from day one — it's the cheapest insurance you'll buy.

Click to reveal

Why is "we can re-shard later" a Staff-disqualifying answer?

Re-sharding is one of the hardest distributed-systems operations. Stripe's took 18 months; Notion's took 2 years. It requires shard-key migration, dual-write windows, cross-shard query audits, and a cutover. The L6 framing: "resharding is the work, not the decision." Choose the shard key carefully upfront and pre-allocate logical shards — that's how you avoid paying the migration tax later.

Click to reveal

What's the difference between data sharding and compute sharding? Name an example of each.

Data sharding partitions storage — each row has one home shard (Spanner, DynamoDB, Vitess, MongoDB). Compute sharding partitions workers — each session or in-memory state has one home worker (Slicer, Flink keyed state, sticky load balancers). The data store underneath compute sharding may itself be sharded or shared — orthogonal concern. Slicer's Assigner / Distributor split (authoritative center + cached fleet) is the canonical implementation, and the same shape powers Envoy XDS, Consul, and Kafka's group coordinator.

Click to reveal