percona-cluster with all nodes down doesn't properly startup w/out intervention

Bug #1744393 reported by Drew Freiberger
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Percona Cluster Charm
Triaged
High
Unassigned

Bug Description

We recently had a cloud-down issue caused by a hard power failure on all nodes of the cloud at the same point in time. This took all three of the percona-cluster units offline and they all powered back up within a minute of each other. All nodes were in non-primary mode when found running and denying connections from openstack services.

All three units noted this within 10 seconds of each other:

180108 16:25:31 [Note] WSREP: Setting initial position to 72d2bae9-5df8-11e6-bb62-cb546a1bb47f:930203610

Then followed with https://pastebin.ubuntu.com/26418976/

180108 16:26:01 [Warning] WSREP: no nodes coming from prim view, prim not possible

and

180108 16:26:12 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
  at gcomm/src/pc.cpp:connect():141
180108 16:26:12 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
180108 16:26:12 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1292: Failed to open channel 'juju_cluster' at 'gcomm://10.28.2.244,10.28.2.194,10.28.2.226': -110 (Connection timed out)

both seem to be important pieces of this issue.

What I'm wondering is why, if all 3 nodes of a percona-cluster determine that they're starting with the same initial position wouldn't the cluster elect a primary member automatically?

This resulted in an extended cloud downtime and manual recovery of many of the nova/neutron services was necessary after manually reforming the percona-cluster.

Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

Percona can't auto-restore from a all-node power failure, a manual intervention is needed.

https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/

Scenario 6 in the above link explains the recovery from full-node failure.

Changed in charm-percona-cluster:
status: New → Invalid
Changed in charm-percona-cluster:
status: Invalid → Triaged
importance: Undecided → Wishlist
assignee: nobody → Aymen Frikha (aym-frikha)
milestone: none → 18.08
Changed in charm-percona-cluster:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.openstack.org/597969

James Page (james-page)
Changed in charm-percona-cluster:
milestone: 18.08 → 18.11
James Page (james-page)
Changed in charm-percona-cluster:
milestone: 18.11 → 19.04
David Ames (thedac)
Changed in charm-percona-cluster:
milestone: 19.04 → 19.07
Changed in charm-percona-cluster:
assignee: Aymen Frikha (aym-frikha) → nobody
status: In Progress → Confirmed
David Ames (thedac)
Changed in charm-percona-cluster:
assignee: nobody → David Ames (thedac)
importance: Wishlist → High
status: Confirmed → In Progress
tags: added: reboot-fail
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/670163

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.opendev.org/670163
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=b97a0971c22f129b71674a22a65080a65c96af76
Submitter: Zuul
Branch: master

commit b97a0971c22f129b71674a22a65080a65c96af76
Author: David Ames <email address hidden>
Date: Wed Jul 10 12:01:06 2019 -0700

    Bootstrap action after a cold boot

    After a cold boot, percona-cluster will require administrative
    intervention. One node will need to bootstrap per upstream
    Percona Cluster documentation:
    https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/

    This change adds an action to bootstrap a single node. On the other
    nodes systemd will be attempting to start percona. Once the bootstrapped
    node is up the others will join automatically.

    Change-Id: Id9a860edc343ee5dbd7fc8c5ce3b4420ec6e134e
    Partial-Bug: #1744393

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/670675

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.opendev.org/670675
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=b8c2213dfbd4ae417be95a8ce1b1c973eee9e55c
Submitter: Zuul
Branch: master

commit b8c2213dfbd4ae417be95a8ce1b1c973eee9e55c
Author: David Ames <email address hidden>
Date: Fri Jul 12 16:16:46 2019 -0700

    Notify bootstrapped action

    It turns out a subsequent required step after a cold boot bootstrap is
    notifying the cluster of the new bootstrap UUID.

    The notify-bootstrapped action should be run on a different node than
    the one which ran the bootstrap-pxc action.

    This action will ensure the cluster converges on the correct bootstrap
    UUID.

    A subsequent patch stacked on this one will include tests for the new
    cold boot actions.

    Change-Id: Idee12d5f7e28498c5ab6ccb9605f751c6427ac30
    Partial-Bug: #1744393

David Ames (thedac)
Changed in charm-percona-cluster:
milestone: 19.07 → 19.10
David Ames (thedac)
Changed in charm-percona-cluster:
milestone: 19.10 → 20.01
tags: added: cold-start
James Page (james-page)
Changed in charm-percona-cluster:
milestone: 20.01 → 20.05
David Ames (thedac)
Changed in charm-percona-cluster:
milestone: 20.05 → 20.08
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Since the action patch is landed but this LP is being kept open i'l' clear the current owner since it doesn't actually have one for the remaining tasks.

Changed in charm-percona-cluster:
assignee: David Ames (thedac) → nobody
status: In Progress → New
James Page (james-page)
Changed in charm-percona-cluster:
milestone: 20.08 → none
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Setting to triaged; it's actually had work done with it, and is blocked on xenial packages at the moment. This bug may eventually time-out on xenial.

Changed in charm-percona-cluster:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-percona-cluster (master)

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-percona-cluster/+/597969
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.