Juju Charms Collection
percona-cluster package

Malformed 3 unit cluster (percona-cluster)

Bug #1655417 reported by Andreas Hasenack on 2017-01-10

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Landscape Server	Invalid	Undecided	Unassigned	Landscape Server 17.02
OpenStack Percona Cluster Charm	Fix Released	High	Unassigned
percona-cluster (Juju Charms Collection)	Invalid	High	Unassigned	Juju Charms Collection 17.01

Bug Description

Using percona-cluster r247 from the charm store, on xenial

What happened initially was that keystone/0 failed in the shared-db-relation-changed hook (keystone-hook-error.txt):

2017-01-10 16:32:26 INFO shared-db-relation-changed keystoneauth1.exceptions.http.InternalServerError: An unexpected error prevented the server from fulfilling your request. (HTTP 500) (Request-ID: req-13812631-7c60-42b3-b38c-5ee360aa8ac3)

Investigation showed that mysql/2 was blocked and that there was no mysql running on that unit:
mysql/2 blocked idle 1/lxd/4 10.2.103.76 Unit is not in sync
hacluster-mysql/2 active idle 10.2.103.76 Unit is ready and clustered

mysql show global status on the leader unit (mysql/0) confirmed that the cluster was composed of two members only (wsrep.txt):
| wsrep_cluster_size | 2 |

crm status on mysql/2 was oblivious to the failure (crm_status.txt):
root@juju-eb69cd-1-lxd-4:~# crm status
Last updated: Tue Jan 10 17:12:38 2017 Last change: Tue Jan 10 16:09:10 2017 by hacluster via crmd on juju-eb69cd-3-lxd-0
Stack: corosync
Current DC: juju-eb69cd-1-lxd-4 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ juju-eb69cd-1-lxd-4 juju-eb69cd-3-lxd-0 juju-eb69cd-4-lxd-4 ]

Full list of resources:

Resource Group: grp_percona_cluster
res_mysql_vip (ocf::heartbeat:IPaddr2): Started juju-eb69cd-3-lxd-0
Clone Set: cl_mysql_monitor [res_mysql_monitor]
Started: [ juju-eb69cd-1-lxd-4 juju-eb69cd-3-lxd-0 juju-eb69cd-4-lxd-4 ]

systemctl status confirmed mysql had existed (systemctl_status.txt):
root@juju-eb69cd-1-lxd-4:/var/log/mysql# systemctl status mysql
● mysql.service - LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon
   Loaded: loaded (/etc/init.d/mysql; bad; vendor preset: enabled)
   Active: active (exited) since Tue 2017-01-10 16:08:03 UTC; 1h 40min ago
     Docs: man:systemd-sysv-generator(8)

Jan 10 16:07:39 juju-eb69cd-1-lxd-4 systemd[1]: Stopped LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon.
Jan 10 16:07:39 juju-eb69cd-1-lxd-4 systemd[1]: Starting LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon...
Jan 10 16:07:39 juju-eb69cd-1-lxd-4 mysql[21801]: * Starting MySQL (Percona XtraDB Cluster) database server mysqld
Jan 10 16:07:42 juju-eb69cd-1-lxd-4 mysql[21801]: * State transfer in progress, setting sleep higher mysqld
Jan 10 16:08:03 juju-eb69cd-1-lxd-4 mysql[21801]: ...done.
Jan 10 16:08:03 juju-eb69cd-1-lxd-4 systemd[1]: Started LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon.

I did a "service mysql start", to no effect. Then I did a "service mysql stop" followed by a "service mysql start", and that fixed things. The cluster now has 3 units, and even juju status agreed the next time update-status ran.

I'm attaching the mentioned attachments, and also a tarball called mysql-2.tar.bz2 which has /var/log from mysql/2 before we tried to fix it.

Tags:

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-10:

mysql-2.tar.bz2 Edit (119.2 KiB, application/x-tar)

/var/log/* from mysql/2 before the attempts to fix it via service restarts.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-10:

crm_status.txt Edit (630 bytes, text/plain)

crm status ran on mysql/2

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-10:

systemctl_status.txt Edit (977 bytes, text/plain)

systemctl status

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-10:

wsrep.txt Edit (5.7 KiB, text/plain)

wsrep data from "show global status" on the leader (mysql/0)

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-10:

keystone-hook-error.txt Edit (7.0 KiB, text/plain)

The original hook error

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-10:

mysql-log-after-restart.txt Edit (11.3 KiB, text/plain)

/var/log/mysql/error.log right when issuing the "service mysql start" that fixed things

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-10:

juju-unit-mysql-2-after-restart.txt Edit (1.3 KiB, text/plain)

juju mysql/2 unit log after the mysql restart that fixed things

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-10:

wsrep-after-restart.txt Edit (5.8 KiB, text/plain)

wsrep data on the master after the restart on mysql/2 that fixed things

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2017-01-10:

all-machines.log.gz Edit (2.2 MiB, application/octet-stream)

An all-machines.log from this deployment, containing logs from all units.

Revision history for this message

David Ames (thedac) wrote on 2017-01-10:

#10

Two problems were found:

1) systemd was unaware of the true state of the mysqld daemon.
stop and then start restored the unit
2) corosync though spewing log entries about mysql being down did not remove the mysql/2 unit from the cluster

Doing some loop testing to try and re-create the problem.

Changed in percona-cluster (Juju Charms Collection):
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 17.01
assignee:	nobody → David Ames (thedac)

Revision history for this message

Narinder Gupta (narindergupta) wrote on 2017-01-17:

#11

the mysql error http://paste.ubuntu.com/23817451/

Andreas Hasenack (ahasenack) on 2017-01-18

Changed in landscape:
milestone:	none → 17.01

Chad Smith (chad.smith) on 2017-02-10

Changed in landscape:
milestone:	17.01 → 17.02

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-11: Fix proposed to charm-percona-cluster (master)

#12

Fix proposed to branch: master
Review: https://review.openstack.org/432502

Changed in percona-cluster (Juju Charms Collection):
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-14: Fix merged to charm-percona-cluster (master)

#13

Reviewed: https://review.openstack.org/432502
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=58d3641dee06a553ed4408b47313b18ca1761d2d
Submitter: Jenkins
Branch: master

commit 58d3641dee06a553ed4408b47313b18ca1761d2d
Author: David Ames <email address hidden>
Date: Tue Feb 7 15:10:11 2017 -0800

Wait until clustered before running client hooks

    Percona cluster takes some time to fully cluster. The charm was
    previously running shared-db-relation-changed hooks whenever they
    were queued even if the cluster was not yet complete. This may lead
    to split brain scenarios or unexpected behavior.

This change confirms the entire cluster is ready before running
client shared-db-relation-changed hooks.

    min-cluster-size can now be used to attempt to guarantee the cluster
    is ready with the expected number of nodes. If min-cluster-size is
    not set the charm will still determine based on the information
    available if all the cluster nodes are ready. Single node
    deployments are still possible.

Partial-Bug: #1655417
Change-Id: Ie9deb266a9682e86f3a9cbc1103b655b13a8295e

James Page (james-page) on 2017-02-23

Changed in charm-percona-cluster:
assignee:	nobody → David Ames (thedac)
importance:	Undecided → High
status:	New → In Progress
Changed in percona-cluster (Juju Charms Collection):
status:	In Progress → Invalid

Revision history for this message

Chad Smith (chad.smith) wrote on 2017-03-16: Re: Malformed 3 unit cluster

#14

Updated worker multiplier to 1.0 and haven't seen this issue since.

Changed in landscape:
status:	New → Invalid

Eric Snow (ericsnowcurrently) on 2017-03-16

summary:

- Malformed 3 unit cluster
+ Malformed 3 unit cluster (percona-cluster)

Ryan Beisner (1chb1n) on 2017-10-27

Changed in charm-percona-cluster:
status:	In Progress → Incomplete

Revision history for this message

David Ames (thedac) wrote on 2021-07-16:

#15

This landed https://review.opendev.org/c/openstack/charm-percona-cluster/+/432502/ years ago. Tiding up.

Changed in charm-percona-cluster:
status:	Incomplete → Fix Released
assignee:	David Ames (thedac) → nobody
Changed in percona-cluster (Juju Charms Collection):
assignee:	David Ames (thedac) → nobody

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Juju Charms Collectionpercona-cluster package

Malformed 3 unit cluster (percona-cluster)

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Juju Charms Collection
percona-cluster package