Failover & Recovery Testing

Now the payoff: we kill the primary and watch the cluster heal itself with no human intervention.

Redis Sentinel failover flow: primary dies, Sentinels detect it, quorum agrees, a replica is promoted, clients re-resolve, the old primary rejoins as a replica

Before: note the current primary

redis-cli -p 26379 sentinel get-master-addr-by-name mymaster
# 1) "10.100.100.101"   (redis-sv01)
# 2) "6379"

Seed a key so we can prove data survives the failover:

redis-cli -h 10.100.100.101 -a ChangeMe_RedisPass --no-auth-warning set ha:canary before-failover
# OK

Trigger an unplanned failover

Simulate a real server death — hard-stop the primary. On a hypervisor that's a force-stop; on a cloud VM, a power-off. (Stopping the service, sudo systemctl stop redis-server, simulates a process crash and works just as well for the test.)

What happens, in order:

Each Sentinel stops getting replies from redis-sv01.
After down-after-milliseconds (5 s) they mark it subjectively down.
Quorum (2 of 3) agree → objectively down.
The Sentinels elect a leader (majority, 2 of 3) which runs the failover: it picks the most up-to-date replica, promotes it with REPLICAOF NO ONE, and reconfigures the other replica to follow the new primary.

After: confirm automatic recovery

Within a few seconds, ask Sentinel again:

redis-cli -p 26379 sentinel get-master-addr-by-name mymaster
# 1) "10.100.100.103"   (redis-sv03 — promoted)
# 2) "6379"

The canary key is present on the new primary, and it accepts writes:

redis-cli -h 10.100.100.103 -a ChangeMe_RedisPass --no-auth-warning get ha:canary
# -> "before-failover"
redis-cli -h 10.100.100.103 -a ChangeMe_RedisPass --no-auth-warning set ha:after written-to-new-primary
# -> OK

Check the new primary's replication view:

redis-cli -h 10.100.100.103 -a ChangeMe_RedisPass --no-auth-warning info replication

role:master
connected_slaves:1
slave0:ip=10.100.100.102,port=6379,state=online,offset=20838,lag=0

Any client pointed at the Sentinels the whole time followed the switch automatically.

Recovery: the old primary rejoins

Start redis-sv01 again. It boots believing it is a master — but the Sentinels immediately reconfigure it: they rewrite its replicaof to point at the new primary, and it rejoins as a replica:

redis-cli -h 10.100.100.101 -a ChangeMe_RedisPass --no-auth-warning info replication

role:slave
master_host:10.100.100.103
master_link_status:up

Three members again — one primary, two replicas — and the row written during the outage is present everywhere. That is a self-healing Redis cluster: automatic failover and automatic recovery, with the client none the wiser.

Optional: a planned switchover

There is no first-class "switchover" verb like some databases have, but you can trigger a controlled failover for maintenance — it promotes a replica immediately, without waiting for a timeout:

redis-cli -p 26379 sentinel failover mymaster

Use it to move the primary role off a node you need to patch or reboot.