Failover & Recovery Testing
Now the payoff: we kill the primary and watch the cluster heal itself with no human intervention.

Before: note the current primary
PGPASSWORD=ChangeMe_Postgres psql -h pg-haproxy -p 5432 -U postgres \
-c "SELECT inet_server_addr();"
# -> 10.100.100.104 (pg-sv01)
Trigger an unplanned failover
Simulate a real server death — hard-power-off pg-sv01 (pull the plug, not a graceful shutdown). On a hypervisor that's a force-stop; on a cloud VM, a stop/power-off.
What happens, in order:
pg-sv01stops renewing its leader lock in etcd.- After the
ttl(30 s) the lock expires. - The surviving Patronis hold quorum (2 of 3) and run an election.
- A replica is promoted to primary and PostgreSQL advances to a new timeline.
- HAProxy's health checks see the old node fail
/primaryand the new node start answering200— it repoints the write port automatically.
Because etcd is co-located here, killing one DB node also removes one etcd member — but 2 of 3 is still a quorum, so the DCS keeps working.
After: confirm automatic recovery
Within roughly 30–45 seconds:
# From a surviving node:
sudo patronictl -c /etc/patroni/patroni.yml list
+ Cluster: pg-cluster --------+---------+-----------+----+
| Member | Host | Role | State | TL |
+---------+----------------+---------+-----------+----+
| pg-sv02 | 10.100.100.105 | Leader | running | 2 | <- promoted, new timeline
| pg-sv03 | 10.100.100.106 | Replica | streaming | 2 |
+---------+----------------+---------+-----------+----+
The write endpoint now points at the new primary — and it is fully writable:
PGPASSWORD=ChangeMe_Postgres psql -h pg-haproxy -p 5432 -U postgres -c \
"CREATE TABLE IF NOT EXISTS failover_test(id serial, ts timestamptz default now());
INSERT INTO failover_test DEFAULT VALUES RETURNING *;"
# -> succeeds against 10.100.100.105
Your application, pointed at pg-haproxy:5432 the whole time, never needed to change anything.
Recovery: the old primary rejoins
Power pg-sv01 back on. Patroni starts automatically (it's enabled) and notices a newer timeline exists, so it rejoins as a replica using pg_rewind instead of a slow full re-clone (this works because we set wal_log_hints: "on"):
sudo patronictl -c /etc/patroni/patroni.yml list
+ Cluster: pg-cluster --------+---------+-----------+----+
| Member | Host | Role | State | TL |
+---------+----------------+---------+-----------+----+
| pg-sv01 | 10.100.100.104 | Replica | streaming | 2 | <- rejoined via pg_rewind
| pg-sv02 | 10.100.100.105 | Leader | running | 2 |
| pg-sv03 | 10.100.100.106 | Replica | streaming | 2 |
+---------+----------------+---------+-----------+----+
Three members again, all on timeline 2, zero lag, and the row you inserted is present on every node. That is a self-healing PostgreSQL cluster: automatic failover and automatic recovery.
Optional: planned switchover
For maintenance (e.g. patching the primary), do a controlled role change with no data loss:
sudo patronictl -c /etc/patroni/patroni.yml switchover
It asks which member to promote and switches cleanly.
No comments to display
No comments to display