OpenStack Percona Cluster Charm

percona-cluster with all nodes down doesn't properly startup w/out intervention

Bug #1744393 reported by Drew Freiberger on 2018-01-19

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Percona Cluster Charm	Triaged	High	Unassigned

Bug Description

We recently had a cloud-down issue caused by a hard power failure on all nodes of the cloud at the same point in time. This took all three of the percona-cluster units offline and they all powered back up within a minute of each other. All nodes were in non-primary mode when found running and denying connections from openstack services.

All three units noted this within 10 seconds of each other:

180108 16:25:31 [Note] WSREP: Setting initial position to 72d2bae9-5df8-11e6-bb62-cb546a1bb47f:930203610

Then followed with https://pastebin.ubuntu.com/26418976/

180108 16:26:01 [Warning] WSREP: no nodes coming from prim view, prim not possible

and

180108 16:26:12 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():141
180108 16:26:12 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
180108 16:26:12 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1292: Failed to open channel 'juju_cluster' at 'gcomm://10.28.2.244,10.28.2.194,10.28.2.226': -110 (Connection timed out)

both seem to be important pieces of this issue.

What I'm wondering is why, if all 3 nodes of a percona-cluster determine that they're starting with the same initial position wouldn't the cluster elect a primary member automatically?

This resulted in an extended cloud downtime and manual recovery of many of the nova/neutron services was necessary after manually reforming the percona-cluster.

Tags:

Felipe Reyes (freyes) on 2018-01-22

tags:

added: sts

Revision history for this message

Mario Splivalo (mariosplivalo) wrote on 2018-01-22:

Percona can't auto-restore from a all-node power failure, a manual intervention is needed.

https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/

Scenario 6 in the above link explains the recovery from full-node failure.

Changed in charm-percona-cluster:
status:	New → Invalid

Billy Olsen (billy-olsen) on 2018-07-18

Changed in charm-percona-cluster:
status:	Invalid → Triaged
importance:	Undecided → Wishlist
assignee:	nobody → Aymen Frikha (aym-frikha)
milestone:	none → 18.08

Aymen Frikha (aym-frikha) on 2018-08-27

Changed in charm-percona-cluster:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-30: Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.openstack.org/597969

James Page (james-page) on 2018-09-12

Changed in charm-percona-cluster:
milestone:	18.08 → 18.11

James Page (james-page) on 2018-11-20

Changed in charm-percona-cluster:
milestone:	18.11 → 19.04

David Ames (thedac) on 2019-04-17

Changed in charm-percona-cluster:
milestone:	19.04 → 19.07

Jason Hobbs (jason-hobbs) on 2019-05-28

Changed in charm-percona-cluster:
assignee:	Aymen Frikha (aym-frikha) → nobody
status:	In Progress → Confirmed

David Ames (thedac) on 2019-06-03

Changed in charm-percona-cluster:
assignee:	nobody → David Ames (thedac)
importance:	Wishlist → High
status:	Confirmed → In Progress

Jason Hobbs (jason-hobbs) on 2019-06-11

tags:

added: reboot-fail

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-10:

Fix proposed to branch: master
Review: https://review.opendev.org/670163

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-11: Fix merged to charm-percona-cluster (master)

Reviewed: https://review.opendev.org/670163
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=b97a0971c22f129b71674a22a65080a65c96af76
Submitter: Zuul
Branch: master

commit b97a0971c22f129b71674a22a65080a65c96af76
Author: David Ames <email address hidden>
Date: Wed Jul 10 12:01:06 2019 -0700

Bootstrap action after a cold boot

    After a cold boot, percona-cluster will require administrative
    intervention. One node will need to bootstrap per upstream
    Percona Cluster documentation:
    https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/

    This change adds an action to bootstrap a single node. On the other
    nodes systemd will be attempting to start percona. Once the bootstrapped
    node is up the others will join automatically.

Change-Id: Id9a860edc343ee5dbd7fc8c5ce3b4420ec6e134e
Partial-Bug: #1744393

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-12: Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/670675

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-19: Fix merged to charm-percona-cluster (master)

Reviewed: https://review.opendev.org/670675
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=b8c2213dfbd4ae417be95a8ce1b1c973eee9e55c
Submitter: Zuul
Branch: master

commit b8c2213dfbd4ae417be95a8ce1b1c973eee9e55c
Author: David Ames <email address hidden>
Date: Fri Jul 12 16:16:46 2019 -0700

Notify bootstrapped action

It turns out a subsequent required step after a cold boot bootstrap is
notifying the cluster of the new bootstrap UUID.

The notify-bootstrapped action should be run on a different node than
the one which ran the bootstrap-pxc action.

This action will ensure the cluster converges on the correct bootstrap
UUID.

A subsequent patch stacked on this one will include tests for the new
cold boot actions.

Change-Id: Idee12d5f7e28498c5ab6ccb9605f751c6427ac30
Partial-Bug: #1744393

David Ames (thedac) on 2019-08-12

Changed in charm-percona-cluster:
milestone:	19.07 → 19.10

David Ames (thedac) on 2019-10-24

Changed in charm-percona-cluster:
milestone:	19.10 → 20.01

Alex Kavanagh (ajkavanagh) on 2019-11-08

tags:

added: cold-start

James Page (james-page) on 2020-03-02

Changed in charm-percona-cluster:
milestone:	20.01 → 20.05

David Ames (thedac) on 2020-05-21

Changed in charm-percona-cluster:
milestone:	20.05 → 20.08

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2020-07-13:

Since the action patch is landed but this LP is being kept open i'l' clear the current owner since it doesn't actually have one for the remaining tasks.

Changed in charm-percona-cluster:
assignee:	David Ames (thedac) → nobody
status:	In Progress → New

James Page (james-page) on 2020-08-03

Changed in charm-percona-cluster:
milestone:	20.08 → none

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2021-02-23:

Setting to triaged; it's actually had work done with it, and is blocked on xenial packages at the moment. This bug may eventually time-out on xenial.

Changed in charm-percona-cluster:
status:	New → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-04: Change abandoned on charm-percona-cluster (master)

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-percona-cluster/+/597969
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.