Skip to main content

Lessons learned & gotchas

  • MTU is the one that will haunt you. A vSwitch with VXLAN encapsulation adds overhead to every frame. Leave the VM/bridge MTU at the default 1500 and you get the classic, maddening symptom: ping works, but SSH hangs and TLS half-loads — small packets pass, large ones silently vanish. Lower the MTU to leave room for the encapsulation header and it all comes back. If "small things work, big things don't," suspect MTU before anything else.
  • Pin Corosync to the private vSwitch explicitly. Don't let it auto-pick and end up on the public interface — that's how a public-traffic spike turns into a cluster falling over.
  • Three nodes is not an accident. Quorum needs an odd number to avoid split-brain. A two-node cluster needs a tie-breaker (a qdevice) or it will eventually deadlock on "who's in charge?"
  • Respect the per-node MAC ceiling. Know the number before you promise yourself a hundred VMs per host; it may be your true ceiling long before CPU or RAM is.
  • Rehearse a node failure in daylight. Pull a node on purpose, watch quorum and migration behave, and fix what doesn't — before you're relying on HA during a real 3am outage.

Lesson in one line: cluster pain is almost always networking pain — MTU, heartbeat isolation, and quorum. Get those three right and the cluster is boring (which is the goal).