Production Hardening & Troubleshooting

The build above is functionally HA. Before production, tighten these.

Persistence — survive a full restart

Replication protects against one node dying; persistence protects against everything restarting at once. Redis offers two mechanisms, and you can run both:

# /etc/redis/redis.conf
appendonly yes              # AOF: logs every write — best durability
appendfsync everysec        # fsync once a second (good balance)
save 900 1                  # RDB snapshots as a fast-restart baseline

AOF replays writes on restart; RDB gives a compact point-in-time snapshot. everysec caps worst-case loss at ~1 second.

Don't let a minority primary accept writes

After a network partition, an isolated old primary could keep taking writes that are lost when the cluster fails over without it. Refuse writes unless replicas are attached:

# /etc/redis/redis.conf
min-replicas-to-write 1
min-replicas-max-lag 10

Now a primary that cannot see at least one reasonably-caught-up replica stops accepting writes — trading a little availability for no silent data loss.

Memory limits

Redis holds everything in RAM. Set a ceiling and an eviction policy so it is never OOM-killed:

maxmemory 1gb
maxmemory-policy noeviction   # or allkeys-lru for a pure cache

noeviction (errors on write when full) suits a datastore; allkeys-lru suits a cache.

Security

Strong requirepass/masterauth — Redis can test hundreds of thousands of guesses per second; use a long random secret.
Network isolation — keep 6379/26379 on a private subnet (we used UFW). Never expose Redis to the internet.

Restrict dangerous commands in untrusted environments:

rename-command FLUSHALL ""
rename-command CONFIG ""

TLS — Redis 6+ supports TLS on a separate port (tls-port 6380, port 0), and Sentinel supports it too. Use it whenever traffic leaves a trusted network.

`announce-ip` — for NAT / containers

If nodes sit behind NAT or in containers where the IP they bind isn't the IP others should reach, tell Redis and Sentinel what to advertise — otherwise discovery hands clients an unreachable address:

# redis.conf
replica-announce-ip 10.100.100.102
# sentinel.conf
sentinel announce-ip 10.100.100.102

Troubleshooting

Symptom	Likely cause
`master_link_status:down` on a replica	Wrong `masterauth`, firewall on `6379`, or primary unreachable
Failover never triggers	Fewer Sentinels up than `quorum`/majority — check `num-other-sentinels`
Clients keep hitting the old primary	Client isn't Sentinel-aware, or a proxy/VIP is masking discovery
Sentinel can't authenticate	Missing `sentinel auth-pass mymaster <pass>`
Split-brain after a partition	Add `min-replicas-to-write`; ensure an odd Sentinel count

What you built

A three-node Redis deployment that detects primary failure by quorum, promotes a replica automatically, redirects clients through Sentinel, and re-absorbs the recovered node as a replica — no proxy, no virtual IP, no manual intervention. That is Redis high availability.