Fuel for OpenStack

fuel 5.1.1 - Galera Cluster node failure and Cloud down.

Bug #1428133 reported by Vasilios Tzanoudakis on 2015-03-04

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	High	Fuel Library (Deprecated)	Fuel for OpenStack 6.1
5.0.x	Invalid	High	Fuel Library (Deprecated)	Fuel for OpenStack 5.0-updates
5.1.x	Invalid	High	MOS Maintenance	Fuel for OpenStack 5.1.1-updates
6.0.x	Invalid	High	MOS Maintenance	Fuel for OpenStack 6.0-updates
6.1.x	Invalid	High	Fuel Library (Deprecated)	Fuel for OpenStack 6.1

Bug Description

Enviroment:

3 Controllers HA + Ceilometer
Ubuntu OS
Ceph for ALL
GRE

ISO used: fuel-community-5.1.1-92-2014-11-08_05-26-47.iso

Here are the logs from mysql : http://paste.openstack.org/show/187667/

The problem in the cloud started on :Mar 4 00:54:55.

At this time the l3 agent wasn't responding and I lost l3 connectivity for all instances.

Pacemaker timed out after some retries trying to bring the mysql up again.

What i have done as a quick fix.
1. Restarted the node that had mysql server instance down.
2. Restarted the l3 agent resource from crm.

Is this someking of galera/mysql bug?

thank you

Tags:

Vasilios Tzanoudakis (vtzanoudakis) on 2015-03-04

affects:

mos → fuel

Vasilios Tzanoudakis (vtzanoudakis) on 2015-03-05

tags:	added: galera
tags:	added: mysql

Revision history for this message

Vasilios Tzanoudakis (vtzanoudakis) wrote on 2015-03-10:

Guys after many hours of digging around I have found that the node crash had to do with innobackupex lock.

check this out.

https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1401133

http://www.percona.com/forums/questions-discussions/percona-xtrabackup/31893-innobackupex-crashing-percona-server-5-6-21-70-1

Revision history for this message

Stanislav Makar (smakar) wrote on 2015-03-13:

Could you please upload diagnostic snapshot ?

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)
milestone:	none → 5.1.2

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-13:

QA team, please confirm either this bug can be reproduce for other releases

Changed in fuel:
importance:	Undecided → High

Revision history for this message

Vasilios Tzanoudakis (vtzanoudakis) wrote on 2015-03-13:

I will make the snapshot now and upload it shortly.

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2015-03-13:

From the description is not clear root cause of issue. The links are referring to the version which not used in Fuel. The initial case from what I understood what network issues. Galera lost quorum, so it decided to perform IST or SST. During that time it looks like it was another network issue, so xtrabackup process was stuck in bad state. To resolve such network issues, there is a manual procedure how to restore MySQL cluster.

Revision history for this message

Vasilios Tzanoudakis (vtzanoudakis) wrote on 2015-03-13:

We used the backup function on all three nodes at the SAME EXACT TIME.
So this may the reason on the subsequent abnormalities you may see....

the xtrabackup version installed is:
xtrabackup version 2.1.4 for Percona Server 5.1.70 unknown-linux-gnu (x86_64) (revision id: undefined)

and InnoDB Backup Utility v1.5.1-xtrabackup

here are the exact backup commands from cron:

mkdir -p /srv/xtrabackup_archive/differential/`date +"%Y"`/`date +"%m"`/`date +"%d"` && innobackupex --stream=xbstream --user=root --incremental --incremental-basedir=/srv/xtrabackup_checkpoints /tmp | gzip - > /srv/xtrabackup_archive/differential/`date +"%Y"`/`date +"%m"`/`date +"%d"`/`hostname`.gr1.etherland.eu_`date +"%F_%R"`_differntial.xbstream.gz

mkdir -p /srv/xtrabackup_archive/full/`date +"%Y"`/`date +"%m"`/`date +"%d"` && innobackupex --stream=xbstream --user=root --extra-lsndir=/srv/xtrabackup_checkpoints /tmp | gzip - > /srv/xtrabackup_archive/full/`date +"%Y"`/`date +"%m"`/`date +"%d"`/`hostname`.gr1.etherland.eu_`date +"%F_%R"`_full.xbstream.gz

mkdir -p /srv/xtrabackup_archive/incremental/`date +"%Y"`/`date +"%m"`/`date +"%d"` && innobackupex --stream=xbstream --user=root --extra-lsndir=/srv/xtrabackup_checkpoints --incremental --incremental-basedir=/srv/xtrabackup_checkpoints /tmp | gzip - > /srv/xtrabackup_archive/incremental/`date +"%Y"`/`date +"%m"`/`date +"%d"`/`hostname`.gr1.etherland.lan_`date +"%F_%R"`_incremental.xbstream.gz

snapshot is been uploaded. I will let you know.

Revision history for this message

Vasilios Tzanoudakis (vtzanoudakis) wrote on 2015-03-13:

Diagnostic Snaphot :

https://www.dropbox.com/s/z5ksy52vqel37a4/fuel-snapshot-2015-03-13_10-51-00.tgz?dl=0

Revision history for this message

Vasilios Tzanoudakis (vtzanoudakis) wrote on 2015-04-03:

The issue stroked again today few hours ago. Here is the latest Diagnostic Snapshot:

https://www.dropbox.com/s/5wuhnr318srdddy/fuel-snapshot-2015-04-03_23-40-51.tgz?dl=0

This is a very serious bug and it stroked again in within few days. The mysql cluster went down again
and we lost connectivity with all instances and services that reside on the mysql cluster too.

Please advice. This is default fuel 5.1.1 installation with no other changes.

Revision history for this message

Vasilios Tzanoudakis (vtzanoudakis) wrote on 2015-04-03:

Please note that again there was no connectivity issue between the controller nodes.

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2015-04-04:

#10

MySQL cluster is controlled by Pacemaker. When you introduce backup script pacemaker should know how to treat in such cases. I would suggest you to backport OCF script as there were some changes to handle such cases

http://docs.mirantis.com/openstack/fuel/fuel-6.0/operations.html#howto-backport-galera-pacemaker-ocf-script

Also, I would recommend to to read

http://docs.mirantis.com/openstack/fuel/fuel-6.0/operations.html#openstack-database-backup-and-restore-with-percona-xtrabackup

Revision history for this message

Vasilios Tzanoudakis (vtzanoudakis) wrote on 2015-04-04:

#11

Unfortunately the bug occurred while backup was disabled on all nodes !!!

Please advise

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-04-09:

#12

Vasilios

In order to recover the database now, you need to find the node which is the best candidate (contains latest data), turn off galera everywhere, then login onto this node, change wsrep.cnf to disable galera plugin, spin up mysql server and run mysqldump to dump everything wherever you want. Then you need to start galera server only on this node by banning pacemaker resource for clone_p_mysql on all the other nodes. This will lead to this node starting as galera master, then you just unban all other nodes and they will start syncing from this node. For more information please ask for help in #fuel-dev IRC channel.

Revision history for this message

Vasilios Tzanoudakis (vtzanoudakis) wrote on 2015-04-09:

#13

Dear Vladimir,

Thank you for your reply and your valuable guidance on the recovery procedure.
The main purpose of this bug is to find the reason why the galera cluster went down second time within few days.

First I thought that it was the backup script that created the problem and I disabled it just to be safe that this was the problem.
But then.... after some days the issue come up again and the mysql cluster went down again.

So I am trying to figure out what caused this issue so that it never happens again.

thank you

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-10-26:

#14

Closing as Invalid for 5.1.1-updates and 6.0-updates as there is no evindence this issue is relevant to 5.1.1 and 6.0

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1429615

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.