[Cluster] - 3. MySQL Cluster
Overview
- MySQL is the most widely used relational database in the world.
- As business grows, a single MySQL instance becomes the bottleneck — both in capacity and reliability.
- This blog traces the evolution of MySQL high availability: from simple replication to consensus-based clustering.
Evolution History: The Long Road to High Availability
Single Node Era
The starting point. One MySQL instance handles everything.
| Problems | Single Node |
|---|---|
| Single point of failure (SPOF) | T |
| Storage limited to single disk | T |
| Write throughput bottleneck | T |
| No automatic failover | \ |
| Replication lag | \ |
| Data consistency risk | \ |
Master-Slave Replication: The First Step
The obvious solution: add replicas.
How It Works:
- Master writes to Binlog (binary log)
- Slave’s IO Thread pulls Binlog from Master
- Slave writes to local Relay Log
- Slave’s SQL Thread replays Relay Log
| Problems | Single Node | Master-Slave |
|---|---|---|
| Single point of failure (SPOF) | T | Partial |
| Storage limited to single disk | T | Solved |
| Write throughput bottleneck | T | T |
| No automatic failover | \ | T |
| Replication lag | \ | T |
| Data consistency risk | \ | T |
Why “Partial” for SPOF?
- Data is now on multiple machines
- But if Master dies, someone must manually promote a Slave
- During this window, writes are blocked
The Lag Problem:
- Replication is asynchronous by default
- User writes data, then immediately reads
- Read goes to a Slave that hasn’t caught up yet
- User sees stale data
Semi-Synchronous Replication: The Compromise
MySQL 5.5 introduced semi-synchronous replication:
- Master writes to Binlog
- Master waits for at least one Slave to acknowledge receipt
- Only then does Master return success to client
| Problems | Single Node | Master-Slave | Semi-Sync |
|---|---|---|---|
| Single point of failure (SPOF) | T | Partial | Partial |
| Storage limited to single disk | T | Solved | Solved |
| Write throughput bottleneck | T | T | T |
| No automatic failover | \ | T | T |
| Replication lag | \ | T | Reduced |
| Data consistency risk | \ | T | Better |
Replication Mode Comparison:
| Mode | Data Safety | Performance |
|---|---|---|
| Async | Risk of data loss | Fast |
| Semi-sync | At least one copy guaranteed | Slower (1 RTT) |
| Full Sync | Majority consensus | Slowest |
The Critical Question: What Happens During the Wait?
When Master waits for Slave ACK, what about concurrent transactions?
Lock Protection:
- The waiting transaction holds row locks
- Any other transaction attempting to modify the same rows blocks
- Blocked until the first transaction completes (ACK received or timeout)
Timeout, Degradation, and Recovery
The Problem:
- Slave dies or network breaks
- Master cannot receive ACK
- Without timeout, Master hangs forever
Timeout Mechanism (rpl_semi_sync_master_timeout, default 10s):
- Master waits for configured timeout
- No ACK received → Master degrades to Async mode
- Transaction commits locally, returns success to client
Async Mode Behavior:
- Master no longer waits for Slave acknowledgment
| Aspect | Semi-Sync Mode | Async Mode (Degraded) |
|---|---|---|
| Wait for Slave? | Yes (ACK required) | No (fire and forget) |
| Write Latency | +1 RTT (network) | Local disk speed only |
| Data Location | Master + at least 1 Slave | Master only |
| If Master Crashes | No data loss (RPO=0) | Data loss (RPO>0) |
The Unprotected Period:
- After degradation, all subsequent transactions (B, C, D…) commit immediately without Slave confirmation
- Data exists only on Master
- If Master crashes during this period, unsynced data is permanently lost
Automatic Recovery:
When Slave reconnects:
- Slave’s IO Thread connects to Master
- Master detects
Rpl_semi_sync_master_clientsincreases from 0 to 1 - Master switches
Rpl_semi_sync_master_statusfrom OFF to ON - Slave requests binlog from last known position
- Master sends accumulated backlog
- Next new transaction: Master waits for ACK again — protection resumes
Catch-Up Latency:
- During backlog transfer, new transactions may experience higher latency
- Network bandwidth is consumed by historical data replication
AFTER_COMMIT vs AFTER_SYNC: From “Semi-Safe” to “Lossless”
MySQL 5.5 introduced the mechanism, but MySQL 5.7 gave us the correctness.
AFTER_COMMIT (MySQL 5.5-5.6 default) — The Flawed Way:
- Binlog → Commit to InnoDB → Wait for ACK → Return to client
- Problem: Data visible before Slave confirms
Crash Scenario Analysis:
- Master crashes while waiting for ACK
- Slave: Never received the data
- Master: Already committed locally — other clients can see this data
- After failover: Data disappears from the new Master (Slave)
- Result: Phantom Data
AFTER_SYNC (MySQL 5.7+ default) — The Lossless Way:
- Controlled by
rpl_semi_sync_master_wait_point = AFTER_SYNC - Binlog → Send to Slave → Wait for ACK → Commit to InnoDB → Return to client
- Data still invisible until ACK received (not committed)
Crash Scenario Analysis:
- Master crashes while waiting for ACK
- Slave: Maybe received it, maybe not
- Master: Not committed — other clients cannot see this data
- After failover: Data missing, but no one ever saw it
- Result: Consistency preserved — Lossless
| Feature | AFTER_COMMIT (5.5) | AFTER_SYNC (5.7+) |
|---|---|---|
| Commit Timing | Before waiting for Slave | After Slave ACK received |
| Data Visibility | Visible during wait | Invisible until ACK |
| Failover Risk | Phantom data possible | Lossless — consistent |
| Verdict | “Semi-Safe” | “Truly Safe” |
External HA: MHA and Orchestrator
Since MySQL itself doesn’t handle failover, external tools emerged.
MHA (Master High Availability):
- Monitor process watches Master via heartbeat
- Master dies → MHA compares all Slaves’ Binlog positions
- Promotes the most up-to-date Slave
- Redirects VIP (Virtual IP) to new Master
| Problems | Single Node | Master-Slave | Semi-Sync | MHA |
|---|---|---|---|---|
| Single point of failure (SPOF) | T | Partial | Partial | Solved |
| Storage limited to single disk | T | Solved | Solved | Solved |
| Write throughput bottleneck | T | T | T | T |
| No automatic failover | \ | T | T | Partial |
| Replication lag | \ | T | Reduced | T |
| Data consistency risk | \ | T | Better | T |
Why “Partial” for Auto Failover?
- MHA provides automatic failover
- But it relies on external monitoring
- Network partitions can cause split-brain scenarios
The Split-Brain Problem:
Network partition creates two “Masters”:
- MHA thinks Master is dead (just network issue)
- Promotes Slave B to new Master
- Old Master A recovers, still accepting writes
- Two Masters, divergent data, disaster
Mitigation:
- STONITH (Shoot The Other Node In The Head)
- Before promoting new Master, physically power off old Master via IPMI/PDU
- Dead nodes can’t write data
MySQL Group Replication (MGR): The Official Answer
MySQL 5.7.17 introduced Group Replication — built-in clustering based on Paxos consensus.
How It Works:
- Write arrives at any node (in multi-primary mode)
- Node broadcasts transaction to group
- Majority must agree before commit
- Conflicting transactions are automatically rolled back
| Problems | Single Node | Master-Slave | Semi-Sync | MHA | MGR |
|---|---|---|---|---|---|
| Single point of failure (SPOF) | T | Partial | Partial | Solved | Solved |
| Storage limited to single disk | T | Solved | Solved | Solved | Solved |
| Write throughput bottleneck | T | T | T | T | T |
| No automatic failover | \ | T | T | Partial | Solved |
| Replication lag | \ | T | Reduced | T | Minimal |
| Data consistency risk | \ | T | Better | T | Strong |
Single-Primary vs Multi-Primary:
| Mode | Writes | Conflicts | Use Case |
|---|---|---|---|
| Single-Primary | One node only | None | Most production systems |
| Multi-Primary | Any node | Auto-detected, loser rolls back | Specific geo-distributed cases |
Warning:
- Multi-Primary sounds cool but has pitfalls
- Auto-increment conflicts (need offset configuration)
- High conflict rate = poor performance
- Most teams use Single-Primary even with MGR
The Physics Problem: Latency is Law
Here’s where many architects fail: you cannot beat the speed of light.
MGR’s Write Penalty
Every MGR write requires:
- Broadcast to all nodes
- Wait for majority ACK
- Then commit
In a LAN (same datacenter):
- +0.5ms to +2ms per write
- Acceptable
Across regions:
- Singapore ↔ New York: ~200ms RTT
- Every write waits 200ms for consensus
- TPS drops to single digits
The Correct Multi-Region Architecture
Don’t do this:
- MGR nodes spread across continents
Do this instead:
- Same-city multi-AZ: MGR cluster within one region (AZ latency < 2ms)
- Cross-region async replication: Separate MGR clusters connected by async replication
- Accept eventual consistency for cross-region reads
Scaling: The Hardest Part
Read Scaling: Easy Mode
Adding read capacity is straightforward:
- Backup existing Slave (XtraBackup)
- Restore to new server
CHANGE MASTER TOpoint to Master- Add to load balancer
Write Scaling: Hard Mode (Sharding)
When single-Master can’t handle write load, you need sharding.
The Pain Points:
- Cross-shard JOINs: Impossible at database level, must handle in application
- Distributed transactions: XA protocol is slow; most use eventual consistency
- Global unique IDs: Auto-increment breaks; need Snowflake or similar
- Resharding: Moving data between shards is operational nightmare
Sharding Strategies:
| Strategy | Example | Pros | Cons |
|---|---|---|---|
| Range | user_id 1-1M → Shard1 | Simple, range queries | Hotspots on recent data |
| Hash | user_id % 4 → Shard N | Even distribution | Range queries impossible |
| Directory | Lookup table | Flexible | Extra hop, SPOF risk |
Online Migration: The Double-Write Pattern
Zero-downtime data migration:
- Enable double-write: Application writes to both old and new location
- Backfill historical data: Background job copies old data
- Verify consistency: Compare checksums
- Switch reads: Gradually move read traffic
- Disable old writes: Final cutover
The Transaction Nightmare
Distributed Transactions: Choose Your Pain
When data spans multiple databases, ACID breaks down.
Option 1: XA/2PC (Two-Phase Commit)
- MySQL native support
- Strong consistency
- Terrible performance: Locks held across network round-trips
- Almost never used in high-throughput systems
Option 2: TCC (Try-Confirm-Cancel)
- Business logic split into three phases
- Better performance than XA
- Massive code complexity: Every operation needs Try/Confirm/Cancel implementations
Option 3: Saga + Event Sourcing
- Chain of local transactions
- Compensating transactions for rollback
- Eventual consistency: Not suitable for financial core
Option 4: Local Message Table
- Write business data + message in same local transaction
- Background job sends message to other services
- Best balance for most internet applications
Read-Write Splitting: The Stale Read Trap
The Problem:
- User updates profile (writes to Master)
- User refreshes page (reads from Slave)
- Slave hasn’t replicated yet
- User sees old data, files bug report
Solutions:
| Approach | Implementation | Trade-off |
|---|---|---|
| Force Master read | After write, read from Master for N seconds | Reduces read scaling benefit |
| Causal consistency | Track write timestamp, route to caught-up Slave | Complex routing logic |
| Accept staleness | For non-critical reads (view counts, etc.) | Limited applicability |
Production War Stories
The Big Transaction Disaster
Scenario:
- Batch job deletes 1 million expired records in single transaction
What happens:
- Binlog grows to gigabytes
- Slaves struggle to replay
- Replication lag spikes to hours
- MGR nodes get kicked out for being too far behind
- Cluster becomes read-only or crashes
Solution:
- Chunk large operations
| |
The Failover That Wasn’t
Scenario:
- MHA triggers failover, but old Master wasn’t actually dead
Timeline:
- 00:00 - Network glitch, MHA loses heartbeat
- 00:01 - MHA promotes Slave B
- 00:02 - Network recovers, old Master A still running
- 00:02-00:10 - Both A and B accepting writes
- 00:10 - DBA notices, panic ensues
- Next 2 days - Manual data reconciliation
Prevention:
- STONITH (power off old Master before promotion)
- Fencing (network isolation scripts)
super_read_onlyon potential Masters
Architecture Decision Guide
When to Use What
| Scenario | Recommendation |
|---|---|
| Startup, < 10k QPS | Single node + daily backup |
| Growth, need HA | Master-Slave + Semi-sync + ProxySQL |
| Enterprise, zero data loss | MGR Single-Primary |
| Global users | Regional MGR clusters + async replication |
| Massive write load | Sharding (but exhaust vertical scaling first) |
The Golden Rules
- Don’t shard until you must: Vertical scaling (bigger machine) is always simpler
- Don’t chase strong consistency globally: Physics wins; accept regional consistency
- Don’t trust auto-failover blindly: Always have runbooks for manual intervention
- Don’t forget the application: Most “database problems” are actually query problems
Summary
MySQL clustering is fundamentally about trade-offs:
- Async replication: Fast but may lose data
- Semi-sync: Balanced but can degrade
- MGR: Safe but slower and complex
- Sharding: Scales writes but breaks SQL semantics
The right architecture depends on your actual requirements, not theoretical ideals. Most systems are fine with Master-Slave + Semi-sync. Only upgrade complexity when you have real problems to solve.