We recently had a cloud-down issue caused by a hard power failure on all nodes of the cloud at the same point in time. This took all three of the percona-cluster units offline and they all powered back up within a minute of each other. All nodes were in non-primary mode when found running and denying connections from openstack services.
All three units noted this within 10 seconds of each other:
180108 16:25:31 [Note] WSREP: Setting initial position to 72d2bae9-5df8-11e6-bb62-cb546a1bb47f:930203610
Then followed with https://pastebin.ubuntu.com/26418976/
180108 16:26:01 [Warning] WSREP: no nodes coming from prim view, prim not possible
and
180108 16:26:12 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():141
180108 16:26:12 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
180108 16:26:12 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1292: Failed to open channel 'juju_cluster' at 'gcomm://10.28.2.244,10.28.2.194,10.28.2.226': -110 (Connection timed out)
both seem to be important pieces of this issue.
What I'm wondering is why, if all 3 nodes of a percona-cluster determine that they're starting with the same initial position wouldn't the cluster elect a primary member automatically?
This resulted in an extended cloud downtime and manual recovery of many of the nova/neutron services was necessary after manually reforming the percona-cluster.
Percona can't auto-restore from a all-node power failure, a manual intervention is needed.
https:/ /www.percona. com/blog/ 2014/09/ 01/galera- replication- how-to- recover- a-pxc-cluster/
Scenario 6 in the above link explains the recovery from full-node failure.