Production Hardening & Troubleshooting
The build above is functionally HA. Before production, tighten these.
Persistence — survive a full restart
Replication protects against one node dying; persistence protects against everything restarting at once. Redis offers two mechanisms, and you can run both:
# /etc/redis/redis.conf
appendonly yes # AOF: logs every write — best durability
appendfsync everysec # fsync once a second (good balance)
save 900 1 # RDB snapshots as a fast-restart baseline
AOF replays writes on restart; RDB gives a compact point-in-time snapshot. everysec caps worst-case loss at ~1 second.
Don't let a minority primary accept writes
After a network partition, an isolated old primary could keep taking writes that are lost when the cluster fails over without it. Refuse writes unless replicas are attached:
# /etc/redis/redis.conf
min-replicas-to-write 1
min-replicas-max-lag 10
Now a primary that cannot see at least one reasonably-caught-up replica stops accepting writes — trading a little availability for no silent data loss.
Memory limits
Redis holds everything in RAM. Set a ceiling and an eviction policy so it is never OOM-killed:
maxmemory 1gb
maxmemory-policy noeviction # or allkeys-lru for a pure cache
noeviction (errors on write when full) suits a datastore; allkeys-lru suits a cache.
Security
- Strong
requirepass/masterauth— Redis can test hundreds of thousands of guesses per second; use a long random secret. - Network isolation — keep
6379/26379on a private subnet (we used UFW). Never expose Redis to the internet. - Restrict dangerous commands in untrusted environments:
rename-command FLUSHALL "" rename-command CONFIG "" - TLS — Redis 6+ supports TLS on a separate port (
tls-port 6380,port 0), and Sentinel supports it too. Use it whenever traffic leaves a trusted network.
announce-ip — for NAT / containers
If nodes sit behind NAT or in containers where the IP they bind isn't the IP others should reach, tell Redis and Sentinel what to advertise — otherwise discovery hands clients an unreachable address:
# redis.conf
replica-announce-ip 10.100.100.102
# sentinel.conf
sentinel announce-ip 10.100.100.102
Troubleshooting
| Symptom | Likely cause |
|---|---|
master_link_status:down on a replica |
Wrong masterauth, firewall on 6379, or primary unreachable |
| Failover never triggers | Fewer Sentinels up than quorum/majority — check num-other-sentinels |
| Clients keep hitting the old primary | Client isn't Sentinel-aware, or a proxy/VIP is masking discovery |
| Sentinel can't authenticate | Missing sentinel auth-pass mymaster <pass> |
| Split-brain after a partition | Add min-replicas-to-write; ensure an odd Sentinel count |
What you built
A three-node Redis deployment that detects primary failure by quorum, promotes a replica automatically, redirects clients through Sentinel, and re-absorbs the recovered node as a replica — no proxy, no virtual IP, no manual intervention. That is Redis high availability.
No comments to display
No comments to display