[Cluster] - 2. Redis Cluster
Overview
- The common three cluster modes are Redis(Split Cluster), MySQL(Master-Slave), Kafka(RAFT).
- This blog will introduce from the single node to the cluster mode of Redis.
Evolution History: From Single Node to Cluster
Single Node Era
- This is the common monolithic architecture of Redis.
| Problems | Single Node |
|---|---|
| Single point of failure (SPOF) | T |
| Memory limited to single machine | T |
| Write throughput bottleneck | T |
Master-Slave Replication
- As to reliability, it is easy to think adding some nodes as Master-Slave.
| Problems | Single Node | Master-Slave |
|---|---|---|
| Single point of failure (SPOF) | T | Solved |
| Memory limited to single machine | T | T |
| Write throughput bottleneck | T | T |
| No automatic failover | \ | T |
- We can see that, Master-Slave mode just solve the SPOF. And even do not have the failover, so only Master-Slave is not something reliable.
Sentinel Mode
- For failover, we can easily think RAFT or other algorithms for keeping it reliable.
- But in Sentinel Mode, it follows other way like ZK which hosts another server for routing.
| Problems | Single Node | Master-Slave | Sentinel |
|---|---|---|---|
| Single point of failure (SPOF) | T | Solved | Solved |
| Memory limited to single machine | T | T | T |
| Write throughput bottleneck | T | T | T |
| No automatic failover | \ | T | Solved |
- One core of cluster is horizontal scale for improving the throughput.
- It is clearly that sentinel mode is not support.
Redis Cluster: The Official Answer
- We know Redis has 16384 slots for saving and loading.
- So it is naturally that we can choose one not common but easy way for cluster.
- Split these 16384 slots for many different Master-Slave Clusters.
| Problems | Single Node | Master-Slave | Sentinel | Cluster |
|---|---|---|---|---|
| Single point of failure (SPOF) | T | Solved | Solved | Solved |
| Memory limited to single machine | T | T | T | Solved |
| Write throughput bottleneck | T | T | T | Solved |
| No automatic failover | \ | T | Solved | Solved |
- In this way, we can do many interesting things. Like split some slots for hot cache, some slots for cold cache.
- Like 0~99 slots are used for hot cache and this cluster can be assembled by 1 Master + 7 Slave.
- Like 100~199 slots are used for clod cache and this cluster can be assembled by 1 Master + 1 Slave.
- Other normal data stored in common cluster based 1 Master + 2 Slave.
Sentinel vs Cluster
Head-to-Head Comparison
| Aspect | Sentinel Mode | Cluster Mode (even single-shard) |
|---|---|---|
| Automatic Failover | Yes (via external Sentinel processes) | Yes (built-in, no extra processes) |
| Deployment Complexity | Need 3+ Sentinel processes + Redis nodes | Just Redis nodes |
| Client SDK | Simple SDK | Smart Client (slightly heavier) |
| Multi-DB (SELECT) | Supported (SELECT 0-15) | Only DB 0 |
| Multi-Key Operations | Full support | Need hash tags for cross-slot |
| Future Scalability | Must migrate to Cluster | Just add nodes |
| Network Overhead | Sentinel heartbeats | Gossip protocol (similar overhead) |
- I should say that in my mind, if the Cluster Mode is assembled by only one Master and two Slave which hold the whole 16384 slots.
- This Cluster is better than Sentinel in every aspect except DB isolation.
Cluster Mode Problems
- The core problem of Cluster Mode is the keys can be in different clusters.
Lua Scripts
- Another problem is Lua scripts may failure while operating different keys which are in different clusters.
- But we can easily solve it by CRC16 algorithm.
Pub / Sub
Before Redis 7.0: Village Loudspeaker Mode
Why broadcast? To support “dumb clients”:
- Client connects to random Node C, sends
SUBSCRIBE news - Node C doesn’t know who else subscribed to
newson other nodes - When someone publishes on Node A, Node A must broadcast to ALL nodes
- Only then can each node deliver the message to its local subscribers
Cost: O(N) network messages per PUBLISH. In a 100-node cluster, every PUBLISH triggers 99 Gossip messages!
After Redis 7.0: Sharded Pub/Sub (Precision Mailbox)
Redis 7.0 made a definition change: Channel IS now a special Key!
| Command | Behavior |
|---|---|
SSUBSCRIBE news | Slot = CRC16(“news”) % 16384, connect to owner node |
SPUBLISH news msg | Route to owner node, deliver locally |
Trade-off:
- Clients must be “smart” (like Redisson) - know the topology, connect to correct node
- Can’t just connect to any random node and subscribe anymore
Summary:
- Old logic (before 7.0): “Village loudspeaker” - convenient but wasteful
- New logic (after 7.0): “Precision mailbox” - efficient but requires smart clients
The Home of Data: Hash Slot and Routing
Traditional Hashing VS Consistent Hashing VS Hash Slot
Traditional Hashing
The simplest approach: node = hash(key) % N
Problem: When N changes (add/remove node), almost ALL keys get remapped!
| |
Consistent Hashing
The industry standard for distributed systems (Cassandra, DynamoDB, etc.)
Core Idea: Both nodes and keys are mapped to a ring (0 ~ 2^32). Each key goes to the first node found walking clockwise.
Benefit: Adding a node only affects ~1/N of keys (the range between new node and its predecessor).
But still has problems for Redis:
- Virtual Nodes Complexity: Need 100-200 virtual nodes per physical node for balance
- Metadata Overhead: Client must store the entire ring (all virtual nodes)
- Migration Granularity: Hard to control exactly which data moves
Hash Slot
- Redis’s pragmatic choice: A fixed array of 16384 slots.
Two-Level Mapping:
- Key -> Slot:
slot = CRC16(key) % 16384(fixed, never changes) - Slot -> Node: Configurable, stored in cluster metadata
No Silver Bullet: When Traditional Hash Wins
Beware of “silver bullet” thinking! Consistent Hashing is NOT universally better.
Where Traditional Hash beats Consistent Hash:
| Aspect | Traditional Hash | Consistent Hash |
|---|---|---|
| Uniformity | Naturally average | Naturally not average |
| Time Complexity | O(1) - CPU instruction level | O(log N) - binary search in TreeMap |
| Distribution | Mathematically perfect uniform | Uneven without virtual nodes (a “hack”) |
| Implementation | 1 line: hash(key) % N | ~50 lines: TreeMap + virtual nodes + ring wrap |
| Memory | Zero overhead | TreeMap for all virtual nodes |
Best scenarios for Traditional Hash:
- Database Sharding:
user_id % 1024for fixed table count (rarely changes) - HashMap/Dict Internals: Language-level hash tables use modulo, not consistent hashing
- Any static node count: When you can guarantee N won’t change
Best scenarios for Consistent Hash:
- Load balancers with dynamic backends
- Distributed cache (Memcached) with frequent node changes
- Any system where nodes join/leave frequently
Engineering Wisdom
- Use the simplest solution that works. If node count is fixed, traditional hash is faster and simpler. Only use consistent hashing when dynamic scaling is a real requirement.
The Hash Slot Algorithm
| |
The Formula:
| |
Why 16384 (2^14)?
This is a hardcore design decision by Antirez - a “bandwidth vs. granularity” trade-off game.
1. The Gossip “Bandwidth Tax”
Every Ping/Pong message carries a Slots Bitmap - each bit represents one slot:
| Slots Count | Bitmap Size | TCP Packets (MTU=1500) |
|---|---|---|
| 65536 (2^16) | 8 KB | 6-7 packets |
| 16384 (2^14) | 2 KB | 2 packets |
8KB per heartbeat = massive bandwidth waste + more TCP fragmentation + higher retransmit probability.
Antirez: “Making the message too big would waste a lot of bandwidth.”
2. The 1000-Node Soft Limit
Redis Cluster targets medium-scale clusters, not Google Spanner-level global systems.
| Slots | Nodes | Slots per Node |
|---|---|---|
| 65536 | 1000 | ~65 |
| 16384 | 1000 | ~16 |
16 slots per node is enough for rebalancing. 65 slots adds negligible benefit but 4x bandwidth cost.
3. Memory Overhead
Each node stores bitmap for ALL other nodes:
| |
| Slots | 1000 Nodes Memory |
|---|---|
| 65536 | 1000 x 8KB = 8MB |
| 16384 | 1000 x 2KB = 2MB |
Bottom Line: 16384 is the Goldilocks number - not too big (wastes bandwidth), not too small (limits granularity). CRC16 can produce 65536, but CRC16(key) % 16384 gives us just what we need.
Hash Tags: Forcing Keys to Same Slot
Practical Use Cases:
| |
Warning: Don’t overuse hash tags! If too many keys share the same tag, you create a “hot slot” problem.
The Gossip Protocol: How Nodes Talk Without a Boss
Why Gossip?
In centralized systems like Kafka, ZooKeeper maintains cluster state. But Redis Cluster has no ZK. How do nodes know about each other?
Answer: Gossip Protocol — nodes exchange information through periodic “chitchat”.
Message Types
| Message | Purpose |
|---|---|
| PING | “Hey, I’m alive! Here’s what I know about the cluster” |
| PONG | Response to PING with sender’s view of cluster state |
| MEET | “Welcome new node, join our cluster” |
| FAIL | “Node X is confirmed dead” |
| PUBLISH | Pub/Sub message broadcast |
What’s Inside a Gossip Message?
Each PING/PONG contains:
| |
Key Information Exchanged:
- My Slots: 2KB bitmap of which slots I own
- My Epoch: My configuration version (critical for conflict resolution)
- Gossip About Others: What I know about N random other nodes
Gossip Frequency and Scale Limits
| |
The Communication Storm Problem:
With N nodes, full mesh = N × (N-1) connections
| Nodes | Connections | Messages/sec (estimated) |
|---|---|---|
| 10 | 90 | ~100 |
| 100 | 9,900 | ~1,000 |
| 1,000 | 999,000 | ~10,000 |
Redis’s Mitigation:
Smart Node Selection (not purely random):
- Each round, randomly pick ~5 nodes from the cluster
- From these 5, choose the one with oldest PONG time (least recently contacted)
- This ensures no node gets “forgotten” while avoiding full mesh
Fallback Mechanism:
- If any node hasn’t responded for >
cluster-node-timeout / 2 - Force send a PING immediately, regardless of random selection
- Prevents false-positive failure detection
- If any node hasn’t responded for >
Partial Gossip:
- Each PING only carries info about ~10% of known nodes (not all)
- Reduces message size while still propagating state eventually
Scale Limit:
- Recommended max: ~1000 nodes
- Beyond this, Gossip overhead becomes significant
Epoch: The Logical Clock
ConfigEpoch is crucial for consistency — it’s like Raft’s “term” or Paxos’s “ballot number”.
When Epoch Increments:
- Slave wins election → becomes new master with higher epoch
- Slot migration completes → new owner gets higher epoch
- Manual failover → forced epoch bump
Scaling: Slot Migration Deep Dive
When Do You Need to Scale?
Scale Out (Add Nodes):
- Memory pressure on existing nodes
- CPU bottleneck
- Network bandwidth saturation
Scale In (Remove Nodes):
- Over-provisioned cluster
- Cost optimization
The Migration State Machine
The MIGRATE Command Internals
MIGRATE Behavior:
- Atomic: Key appears on target and disappears from source atomically
- Blocking: By default, blocks the source node during transfer
- Timeout: Configurable timeout to prevent stuck migrations
The Blocking Problem:
| |
For Large Keys: A single large key (big hash, big list) can block the source node for seconds!
Mitigation: Redis 6.0+ supports non-blocking migration for certain data types.
Request Handling During Migration
During migration, Slot 100 is in a transient state — moving from Node A to Node B but not yet complete.
Node States:
| Node | State | Responsibility |
|---|---|---|
| Node A | MIGRATING | Still owns Slot 100, but data is moving out |
| Node B | IMPORTING | Receiving data, but not officially responsible |
| Client | - | Slot Map still points Slot 100 → Node A |
Request Flow:
- Client → Node A: Client sends request based on cached Slot Map
- Node A checks local:
- Key exists → Process and return result
- Key missing → Return
-ASK <Node B>(key already migrated)
- Client → Node B: Must send
ASKINGcommand first, then the original command - Node B checks ASKING flag:
- Flag present → Execute command
- Flag absent → Return
-MOVED <Node A>
Why Require ASKING?
The ASKING command prevents routing table corruption:
- Without
ASKING: A random client connecting to Node B might incorrectly assume Slot 100 belongs to B - Client updates its Slot Map prematurely → All future requests go to B
- But migration just started → Most keys still on A → Severe cache misses
The ASKING flag acts as a one-time authorization token — only clients explicitly redirected by Node A (via -ASK)
can access the importing slot.
ASK vs MOVED:
| Aspect | MOVED | ASK |
|---|---|---|
| When | Migration completed | Migration in progress |
| Client action | Update Slot Map | Do NOT update Slot Map |
| Semantics | Permanent redirect | Temporary redirect |
Failure Detection and Automatic Failover
The Distributed Voting Problem
The Challenge: Without a central authority, how do nodes agree that a node is dead?
The Answer: Quorum-based failure detection through gossip.
PFAIL vs FAIL: The Two-Phase Detection
The Configuration: cluster-node-timeout
| |
What This Controls:
- PFAIL Trigger: Node marked PFAIL after timeout with no PONG
- Failover Speed: Lower = faster detection, but more false positives
- Network Partition Sensitivity: Too low = frequent unnecessary failovers
Rule of Thumb:
- Production: 15-30 seconds
- Testing: 5-10 seconds
- Never below 5 seconds
Slave Election: Choosing the New Master
Manual Failover
Sometimes you want to failover deliberately (maintenance, upgrades):
| |
CLUSTER FAILOVER (graceful):
- Slave tells master “stop accepting writes”
- Master stops, slave catches up
- Slave becomes master
- No data loss!
CLUSTER FAILOVER TAKEOVER:
- Doesn’t need master’s consent
- May lose recent writes
- Use only when master is unreachable
Consistency Trade-offs: What Redis Cluster Sacrifices
CAP Theorem Recap
Asynchronous Replication: The Data Loss Window
The Split-Brain Scenario
Mitigation: min-replicas-to-write
| |
Trade-off: Better consistency, but sacrifices availability.