Skip to main content

Production Hardening & Troubleshooting

The cluster we built is complete and self-healing, but a few choices were made to keep the lab to four machines. Here's what to change before relying on it in production.

Harden the topology

  • Separate the etcd cluster. We co-located etcd on the database nodes. In production, run etcd as its own 3-node cluster on small, dedicated machines so a loaded or slow database can never jeopardise the consensus layer. Patroni just points at the external etcd endpoints — nothing else changes.
  • Make HAProxy highly available. A single HAProxy is a single point of failure. Run two HAProxy nodes with keepalived sharing a floating virtual IP (VIP), so clients connect to the VIP and it moves to the surviving proxy.
  • Add connection pooling. Put PgBouncer in front of (or behind) HAProxy. PostgreSQL connections are heavy; a pooler dramatically improves behaviour under many short-lived clients.

Durability and security

  • Tune the failover/lag thresholds. ttl, loop_wait, and maximum_lag_on_failover trade detection speed against stability. The defaults here (30/10) are sane starting points.
  • Consider synchronous replication for zero-data-loss failover. Set Patroni's synchronous_mode: true; you can require a quorum of standbys to confirm each commit. This costs write latency — enable it only if your durability requirements demand it.
  • Encrypt everything. Enable TLS on PostgreSQL (client and replication) and on etcd (client + peer). In this guide etcd ran over plain HTTP on a trusted subnet — fine for a lab, not for production.
  • Lock down the firewall to specific peer IPs (as we did) rather than whole subnets, and keep Patroni's REST API reachable only from HAProxy and admins.
  • Back up. HA is not backup. Keep base backups + WAL archiving (e.g. pgBackRest) so you can do point-in-time recovery.

Troubleshooting

Patroni won't start — Can not find suitable configuration of distributed configuration store The etcd client library is missing. Install it: sudo apt-get install -y python3-etcd, then restart Patroni.

etcd service hangs on first start With ETCD_INITIAL_CLUSTER_STATE=new, members wait for quorum. Make sure you start all three at once; if one was started alone and timed out, systemctl restart etcd on every node together.

A replica won't come up / stuck in creating replica Check it can reach the leader on 5432 (firewall) and that the replicator password in patroni.yml matches on all nodes. patronictl reinit pg-cluster <member> forces a clean re-clone.

Old primary won't rejoin after failover pg_rewind needs wal_log_hints: "on" (or data checksums). If it still fails, patronictl reinit the node to re-clone it from the new leader.

HAProxy shows every backend DOWN The health check targets Patroni on 8008, not PostgreSQL. Confirm check port 8008 is set and that UFW allows 8008 from the HAProxy node. Test by hand: curl -s -o /dev/null -w "%{http_code}" http://pg-sv01:8008/primary.

Check the logs

journalctl -u patroni -f          # Patroni / PostgreSQL events
journalctl -u etcd -f             # DCS
sudo patronictl -c /etc/patroni/patroni.yml list   # current truth

You now have a production-shaped blueprint for PostgreSQL high availability: streaming replication managed by Patroni, quorum from etcd, and transparent routing through HAProxy — with automatic failover and self-healing recovery proven end to end.