Skip to main content

Production Hardening & Troubleshooting

The build above is functionally HA. Before production, tighten these.

Persistence — survive a full restart

Replication protects against one node dying; persistence protects against everything restarting at once. Redis offers two mechanisms, and you can run both:

# /etc/redis/redis.conf
appendonly yes              # AOF: logs every write — best durability
appendfsync everysec        # fsync once a second (good balance)
save 900 1                  # RDB snapshots as a fast-restart baseline

AOF replays writes on restart; RDB gives a compact point-in-time snapshot. everysec caps worst-case loss at ~1 second.

Don't let a minority primary accept writes

After a network partition, an isolated old primary could keep taking writes that are lost when the cluster fails over without it. Refuse writes unless replicas are attached:

# /etc/redis/redis.conf
min-replicas-to-write 1
min-replicas-max-lag 10

Now a primary that cannot see at least one reasonably-caught-up replica stops accepting writes — trading a little availability for no silent data loss.

Memory limits

Redis holds everything in RAM. Set a ceiling and an eviction policy so it is never OOM-killed:

maxmemory 1gb
maxmemory-policy noeviction   # or allkeys-lru for a pure cache

noeviction (errors on write when full) suits a datastore; allkeys-lru suits a cache.

Security

  • Strong requirepass/masterauth — Redis can test hundreds of thousands of guesses per second; use a long random secret.
  • Network isolation — keep 6379/26379 on a private subnet (we used UFW). Never expose Redis to the internet.
  • Restrict dangerous commands in untrusted environments:
    rename-command FLUSHALL ""
    rename-command CONFIG ""
    
  • TLS — Redis 6+ supports TLS on a separate port (tls-port 6380, port 0), and Sentinel supports it too. Use it whenever traffic leaves a trusted network.

announce-ip — for NAT / containers

If nodes sit behind NAT or in containers where the IP they bind isn't the IP others should reach, tell Redis and Sentinel what to advertise — otherwise discovery hands clients an unreachable address:

# redis.conf
replica-announce-ip 10.100.100.102
# sentinel.conf
sentinel announce-ip 10.100.100.102

Troubleshooting

Symptom Likely cause
master_link_status:down on a replica Wrong masterauth, firewall on 6379, or primary unreachable
Failover never triggers Fewer Sentinels up than quorum/majority — check num-other-sentinels
Clients keep hitting the old primary Client isn't Sentinel-aware, or a proxy/VIP is masking discovery
Sentinel can't authenticate Missing sentinel auth-pass mymaster <pass>
Split-brain after a partition Add min-replicas-to-write; ensure an odd Sentinel count

What you built

A three-node Redis deployment that detects primary failure by quorum, promotes a replica automatically, redirects clients through Sentinel, and re-absorbs the recovered node as a replica — no proxy, no virtual IP, no manual intervention. That is Redis high availability.