[Cluster] - 3. MySQL Cluster

2025-12-08 1868 words 9 minutes

Contents

Overview

MySQL is the most widely used relational database in the world.
As business grows, a single MySQL instance becomes the bottleneck — both in capacity and reliability.
This blog traces the evolution of MySQL high availability: from simple replication to consensus-based clustering.

Evolution History: The Long Road to High Availability

Single Node Era

The starting point. One MySQL instance handles everything.

Problems	Single Node
Single point of failure (SPOF)	T
Storage limited to single disk	T
Write throughput bottleneck	T
No automatic failover	\
Replication lag	\
Data consistency risk	\

Master-Slave Replication: The First Step

The obvious solution: add replicas.

How It Works:

Master writes to Binlog (binary log)
Slave’s IO Thread pulls Binlog from Master
Slave writes to local Relay Log
Slave’s SQL Thread replays Relay Log

Problems	Single Node	Master-Slave
Single point of failure (SPOF)	T	Partial
Storage limited to single disk	T	Solved
Write throughput bottleneck	T	T
No automatic failover	\	T
Replication lag	\	T
Data consistency risk	\	T

Why “Partial” for SPOF?

Data is now on multiple machines
But if Master dies, someone must manually promote a Slave
During this window, writes are blocked

The Lag Problem:

Replication is asynchronous by default
User writes data, then immediately reads
Read goes to a Slave that hasn’t caught up yet
User sees stale data

Semi-Synchronous Replication: The Compromise

MySQL 5.5 introduced semi-synchronous replication:

Master writes to Binlog
Master waits for at least one Slave to acknowledge receipt
Only then does Master return success to client

Problems	Single Node	Master-Slave	Semi-Sync
Single point of failure (SPOF)	T	Partial	Partial
Storage limited to single disk	T	Solved	Solved
Write throughput bottleneck	T	T	T
No automatic failover	\	T	T
Replication lag	\	T	Reduced
Data consistency risk	\	T	Better

Replication Mode Comparison:

Mode	Data Safety	Performance
Async	Risk of data loss	Fast
Semi-sync	At least one copy guaranteed	Slower (1 RTT)
Full Sync	Majority consensus	Slowest

The Critical Question: What Happens During the Wait?

When Master waits for Slave ACK, what about concurrent transactions?

Lock Protection:

The waiting transaction holds row locks
Any other transaction attempting to modify the same rows blocks
Blocked until the first transaction completes (ACK received or timeout)

Timeout, Degradation, and Recovery

The Problem:

Slave dies or network breaks
Master cannot receive ACK
Without timeout, Master hangs forever

Timeout Mechanism (rpl_semi_sync_master_timeout, default 10s):

Master waits for configured timeout
No ACK received → Master degrades to Async mode
Transaction commits locally, returns success to client

Async Mode Behavior:

Master no longer waits for Slave acknowledgment

Aspect	Semi-Sync Mode	Async Mode (Degraded)
Wait for Slave?	Yes (ACK required)	No (fire and forget)
Write Latency	+1 RTT (network)	Local disk speed only
Data Location	Master + at least 1 Slave	Master only
If Master Crashes	No data loss (RPO=0)	Data loss (RPO>0)

The Unprotected Period:

After degradation, all subsequent transactions (B, C, D…) commit immediately without Slave confirmation
Data exists only on Master
If Master crashes during this period, unsynced data is permanently lost

Automatic Recovery:

When Slave reconnects:

Slave’s IO Thread connects to Master
Master detects Rpl_semi_sync_master_clients increases from 0 to 1
Master switches Rpl_semi_sync_master_status from OFF to ON
Slave requests binlog from last known position
Master sends accumulated backlog
Next new transaction: Master waits for ACK again — protection resumes

Catch-Up Latency:

During backlog transfer, new transactions may experience higher latency
Network bandwidth is consumed by historical data replication

AFTER_COMMIT vs AFTER_SYNC: From “Semi-Safe” to “Lossless”

MySQL 5.5 introduced the mechanism, but MySQL 5.7 gave us the correctness.

AFTER_COMMIT (MySQL 5.5-5.6 default) — The Flawed Way:

Binlog → Commit to InnoDB → Wait for ACK → Return to client
Problem: Data visible before Slave confirms

Crash Scenario Analysis:

Master crashes while waiting for ACK
Slave: Never received the data
Master: Already committed locally — other clients can see this data
After failover: Data disappears from the new Master (Slave)
Result: Phantom Data

AFTER_SYNC (MySQL 5.7+ default) — The Lossless Way:

Controlled by rpl_semi_sync_master_wait_point = AFTER_SYNC
Binlog → Send to Slave → Wait for ACK → Commit to InnoDB → Return to client
Data still invisible until ACK received (not committed)

Crash Scenario Analysis:

Master crashes while waiting for ACK
Slave: Maybe received it, maybe not
Master: Not committed — other clients cannot see this data
After failover: Data missing, but no one ever saw it
Result: Consistency preserved — Lossless

Feature	AFTER_COMMIT (5.5)	AFTER_SYNC (5.7+)
Commit Timing	Before waiting for Slave	After Slave ACK received
Data Visibility	Visible during wait	Invisible until ACK
Failover Risk	Phantom data possible	Lossless — consistent
Verdict	“Semi-Safe”	“Truly Safe”

External HA: MHA and Orchestrator

Since MySQL itself doesn’t handle failover, external tools emerged.

MHA (Master High Availability):

Monitor process watches Master via heartbeat
Master dies → MHA compares all Slaves’ Binlog positions
Promotes the most up-to-date Slave
Redirects VIP (Virtual IP) to new Master

Problems	Single Node	Master-Slave	Semi-Sync	MHA
Single point of failure (SPOF)	T	Partial	Partial	Solved
Storage limited to single disk	T	Solved	Solved	Solved
Write throughput bottleneck	T	T	T	T
No automatic failover	\	T	T	Partial
Replication lag	\	T	Reduced	T
Data consistency risk	\	T	Better	T

Why “Partial” for Auto Failover?

MHA provides automatic failover
But it relies on external monitoring
Network partitions can cause split-brain scenarios

The Split-Brain Problem:

Network partition creates two “Masters”:

MHA thinks Master is dead (just network issue)
Promotes Slave B to new Master
Old Master A recovers, still accepting writes
Two Masters, divergent data, disaster

Mitigation:

STONITH (Shoot The Other Node In The Head)
Before promoting new Master, physically power off old Master via IPMI/PDU
Dead nodes can’t write data

MySQL Group Replication (MGR): The Official Answer

MySQL 5.7.17 introduced Group Replication — built-in clustering based on Paxos consensus.

How It Works:

Write arrives at any node (in multi-primary mode)
Node broadcasts transaction to group
Majority must agree before commit
Conflicting transactions are automatically rolled back

Problems	Single Node	Master-Slave	Semi-Sync	MHA	MGR
Single point of failure (SPOF)	T	Partial	Partial	Solved	Solved
Storage limited to single disk	T	Solved	Solved	Solved	Solved
Write throughput bottleneck	T	T	T	T	T
No automatic failover	\	T	T	Partial	Solved
Replication lag	\	T	Reduced	T	Minimal
Data consistency risk	\	T	Better	T	Strong

Single-Primary vs Multi-Primary:

Mode	Writes	Conflicts	Use Case
Single-Primary	One node only	None	Most production systems
Multi-Primary	Any node	Auto-detected, loser rolls back	Specific geo-distributed cases

Warning:

Multi-Primary sounds cool but has pitfalls
Auto-increment conflicts (need offset configuration)
High conflict rate = poor performance
Most teams use Single-Primary even with MGR

The Physics Problem: Latency is Law

Here’s where many architects fail: you cannot beat the speed of light.

MGR’s Write Penalty

Every MGR write requires:

Broadcast to all nodes
Wait for majority ACK
Then commit

In a LAN (same datacenter):

+0.5ms to +2ms per write
Acceptable

Across regions:

Singapore ↔ New York: ~200ms RTT
Every write waits 200ms for consensus
TPS drops to single digits

The Correct Multi-Region Architecture

Don’t do this:

MGR nodes spread across continents

Do this instead:

Same-city multi-AZ: MGR cluster within one region (AZ latency < 2ms)
Cross-region async replication: Separate MGR clusters connected by async replication
Accept eventual consistency for cross-region reads

Scaling: The Hardest Part

Read Scaling: Easy Mode

Adding read capacity is straightforward:

Backup existing Slave (XtraBackup)
Restore to new server
CHANGE MASTER TO point to Master
Add to load balancer

Write Scaling: Hard Mode (Sharding)

When single-Master can’t handle write load, you need sharding.

The Pain Points:

Cross-shard JOINs: Impossible at database level, must handle in application
Distributed transactions: XA protocol is slow; most use eventual consistency
Global unique IDs: Auto-increment breaks; need Snowflake or similar
Resharding: Moving data between shards is operational nightmare

Sharding Strategies:

Strategy	Example	Pros	Cons
Range	user_id 1-1M → Shard1	Simple, range queries	Hotspots on recent data
Hash	user_id % 4 → Shard N	Even distribution	Range queries impossible
Directory	Lookup table	Flexible	Extra hop, SPOF risk

Online Migration: The Double-Write Pattern

Zero-downtime data migration:

Enable double-write: Application writes to both old and new location
Backfill historical data: Background job copies old data
Verify consistency: Compare checksums
Switch reads: Gradually move read traffic
Disable old writes: Final cutover

The Transaction Nightmare

Distributed Transactions: Choose Your Pain

When data spans multiple databases, ACID breaks down.

Option 1: XA/2PC (Two-Phase Commit)

MySQL native support
Strong consistency
Terrible performance: Locks held across network round-trips
Almost never used in high-throughput systems

Option 2: TCC (Try-Confirm-Cancel)

Business logic split into three phases
Better performance than XA
Massive code complexity: Every operation needs Try/Confirm/Cancel implementations

Option 3: Saga + Event Sourcing

Chain of local transactions
Compensating transactions for rollback
Eventual consistency: Not suitable for financial core

Option 4: Local Message Table

Write business data + message in same local transaction
Background job sends message to other services
Best balance for most internet applications

Read-Write Splitting: The Stale Read Trap

The Problem:

User updates profile (writes to Master)
User refreshes page (reads from Slave)
Slave hasn’t replicated yet
User sees old data, files bug report

Solutions:

Approach	Implementation	Trade-off
Force Master read	After write, read from Master for N seconds	Reduces read scaling benefit
Causal consistency	Track write timestamp, route to caught-up Slave	Complex routing logic
Accept staleness	For non-critical reads (view counts, etc.)	Limited applicability

Production War Stories

The Big Transaction Disaster

Scenario:

Batch job deletes 1 million expired records in single transaction

What happens:

Binlog grows to gigabytes
Slaves struggle to replay
Replication lag spikes to hours
MGR nodes get kicked out for being too far behind
Cluster becomes read-only or crashes

Solution:

Chunk large operations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Bad
DELETE
FROM logs
WHERE created_at < '2024-01-01';

-- Good
DELETE
FROM logs
WHERE created_at < '2024-01-01' LIMIT 10000;
-- Repeat in loop with sleep between batches

The Failover That Wasn’t

Scenario:

MHA triggers failover, but old Master wasn’t actually dead

Timeline:

00:00 - Network glitch, MHA loses heartbeat
00:01 - MHA promotes Slave B
00:02 - Network recovers, old Master A still running
00:02-00:10 - Both A and B accepting writes
00:10 - DBA notices, panic ensues
Next 2 days - Manual data reconciliation

Prevention:

STONITH (power off old Master before promotion)
Fencing (network isolation scripts)
super_read_only on potential Masters

Architecture Decision Guide

When to Use What

Scenario	Recommendation
Startup, < 10k QPS	Single node + daily backup
Growth, need HA	Master-Slave + Semi-sync + ProxySQL
Enterprise, zero data loss	MGR Single-Primary
Global users	Regional MGR clusters + async replication
Massive write load	Sharding (but exhaust vertical scaling first)

The Golden Rules

Don’t shard until you must: Vertical scaling (bigger machine) is always simpler
Don’t chase strong consistency globally: Physics wins; accept regional consistency
Don’t trust auto-failover blindly: Always have runbooks for manual intervention
Don’t forget the application: Most “database problems” are actually query problems

Summary

MySQL clustering is fundamentally about trade-offs:

Async replication: Fast but may lose data
Semi-sync: Balanced but can degrade
MGR: Safe but slower and complex
Sharding: Scales writes but breaks SQL semantics

The right architecture depends on your actual requirements, not theoretical ideals. Most systems are fine with Master-Slave + Semi-sync. Only upgrade complexity when you have real problems to solve.