Malformed 3 unit cluster (percona-cluster)

Bug #1655417 reported by Andreas Hasenack
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Landscape Server
Invalid
Undecided
Unassigned
OpenStack Percona Cluster Charm
Fix Released
High
Unassigned
percona-cluster (Juju Charms Collection)
Invalid
High
Unassigned

Bug Description

Using percona-cluster r247 from the charm store, on xenial

What happened initially was that keystone/0 failed in the shared-db-relation-changed hook (keystone-hook-error.txt):

2017-01-10 16:32:26 INFO shared-db-relation-changed keystoneauth1.exceptions.http.InternalServerError: An unexpected error prevented the server from fulfilling your request. (HTTP 500) (Request-ID: req-13812631-7c60-42b3-b38c-5ee360aa8ac3)

Investigation showed that mysql/2 was blocked and that there was no mysql running on that unit:
mysql/2 blocked idle 1/lxd/4 10.2.103.76 Unit is not in sync
  hacluster-mysql/2 active idle 10.2.103.76 Unit is ready and clustered

mysql show global status on the leader unit (mysql/0) confirmed that the cluster was composed of two members only (wsrep.txt):
| wsrep_cluster_size | 2 |

crm status on mysql/2 was oblivious to the failure (crm_status.txt):
root@juju-eb69cd-1-lxd-4:~# crm status
Last updated: Tue Jan 10 17:12:38 2017 Last change: Tue Jan 10 16:09:10 2017 by hacluster via crmd on juju-eb69cd-3-lxd-0
Stack: corosync
Current DC: juju-eb69cd-1-lxd-4 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ juju-eb69cd-1-lxd-4 juju-eb69cd-3-lxd-0 juju-eb69cd-4-lxd-4 ]

Full list of resources:

 Resource Group: grp_percona_cluster
     res_mysql_vip (ocf::heartbeat:IPaddr2): Started juju-eb69cd-3-lxd-0
 Clone Set: cl_mysql_monitor [res_mysql_monitor]
     Started: [ juju-eb69cd-1-lxd-4 juju-eb69cd-3-lxd-0 juju-eb69cd-4-lxd-4 ]

systemctl status confirmed mysql had existed (systemctl_status.txt):
root@juju-eb69cd-1-lxd-4:/var/log/mysql# systemctl status mysql
● mysql.service - LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon
   Loaded: loaded (/etc/init.d/mysql; bad; vendor preset: enabled)
   Active: active (exited) since Tue 2017-01-10 16:08:03 UTC; 1h 40min ago
     Docs: man:systemd-sysv-generator(8)

Jan 10 16:07:39 juju-eb69cd-1-lxd-4 systemd[1]: Stopped LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon.
Jan 10 16:07:39 juju-eb69cd-1-lxd-4 systemd[1]: Starting LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon...
Jan 10 16:07:39 juju-eb69cd-1-lxd-4 mysql[21801]: * Starting MySQL (Percona XtraDB Cluster) database server mysqld
Jan 10 16:07:42 juju-eb69cd-1-lxd-4 mysql[21801]: * State transfer in progress, setting sleep higher mysqld
Jan 10 16:08:03 juju-eb69cd-1-lxd-4 mysql[21801]: ...done.
Jan 10 16:08:03 juju-eb69cd-1-lxd-4 systemd[1]: Started LSB: Start and stop the mysql (Percona XtraDB Cluster) daemon.

I did a "service mysql start", to no effect. Then I did a "service mysql stop" followed by a "service mysql start", and that fixed things. The cluster now has 3 units, and even juju status agreed the next time update-status ran.

I'm attaching the mentioned attachments, and also a tarball called mysql-2.tar.bz2 which has /var/log from mysql/2 before we tried to fix it.

Tags: landscape
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

/var/log/* from mysql/2 before the attempts to fix it via service restarts.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

crm status ran on mysql/2

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

systemctl status

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

wsrep data from "show global status" on the leader (mysql/0)

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The original hook error

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

/var/log/mysql/error.log right when issuing the "service mysql start" that fixed things

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

juju mysql/2 unit log after the mysql restart that fixed things

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

wsrep data on the master after the restart on mysql/2 that fixed things

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

An all-machines.log from this deployment, containing logs from all units.

Revision history for this message
David Ames (thedac) wrote :

Two problems were found:

1) systemd was unaware of the true state of the mysqld daemon.
   stop and then start restored the unit
2) corosync though spewing log entries about mysql being down did not remove the mysql/2 unit from the cluster

Doing some loop testing to try and re-create the problem.

Changed in percona-cluster (Juju Charms Collection):
status: New → Triaged
importance: Undecided → High
milestone: none → 17.01
assignee: nobody → David Ames (thedac)
Revision history for this message
Narinder Gupta (narindergupta) wrote :
Changed in landscape:
milestone: none → 17.01
Chad Smith (chad.smith)
Changed in landscape:
milestone: 17.01 → 17.02
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.openstack.org/432502

Changed in percona-cluster (Juju Charms Collection):
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.openstack.org/432502
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=58d3641dee06a553ed4408b47313b18ca1761d2d
Submitter: Jenkins
Branch: master

commit 58d3641dee06a553ed4408b47313b18ca1761d2d
Author: David Ames <email address hidden>
Date: Tue Feb 7 15:10:11 2017 -0800

    Wait until clustered before running client hooks

    Percona cluster takes some time to fully cluster. The charm was
    previously running shared-db-relation-changed hooks whenever they
    were queued even if the cluster was not yet complete. This may lead
    to split brain scenarios or unexpected behavior.

    This change confirms the entire cluster is ready before running
    client shared-db-relation-changed hooks.

    min-cluster-size can now be used to attempt to guarantee the cluster
    is ready with the expected number of nodes. If min-cluster-size is
    not set the charm will still determine based on the information
    available if all the cluster nodes are ready. Single node
    deployments are still possible.

    Partial-Bug: #1655417
    Change-Id: Ie9deb266a9682e86f3a9cbc1103b655b13a8295e

James Page (james-page)
Changed in charm-percona-cluster:
assignee: nobody → David Ames (thedac)
importance: Undecided → High
status: New → In Progress
Changed in percona-cluster (Juju Charms Collection):
status: In Progress → Invalid
Revision history for this message
Chad Smith (chad.smith) wrote : Re: Malformed 3 unit cluster

Updated worker multiplier to 1.0 and haven't seen this issue since.

Changed in landscape:
status: New → Invalid
summary: - Malformed 3 unit cluster
+ Malformed 3 unit cluster (percona-cluster)
Ryan Beisner (1chb1n)
Changed in charm-percona-cluster:
status: In Progress → Incomplete
Revision history for this message
David Ames (thedac) wrote :
Changed in charm-percona-cluster:
status: Incomplete → Fix Released
assignee: David Ames (thedac) → nobody
Changed in percona-cluster (Juju Charms Collection):
assignee: David Ames (thedac) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.