fuel 5.1.1 - Galera Cluster node failure and Cloud down.

Bug #1428133 reported by Vasilios Tzanoudakis
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel Library (Deprecated)
5.0.x
Invalid
High
Fuel Library (Deprecated)
5.1.x
Invalid
High
MOS Maintenance
6.0.x
Invalid
High
MOS Maintenance
6.1.x
Invalid
High
Fuel Library (Deprecated)

Bug Description

Enviroment:

3 Controllers HA + Ceilometer
Ubuntu OS
Ceph for ALL
GRE

ISO used: fuel-community-5.1.1-92-2014-11-08_05-26-47.iso

Here are the logs from mysql : http://paste.openstack.org/show/187667/

The problem in the cloud started on :Mar 4 00:54:55.

At this time the l3 agent wasn't responding and I lost l3 connectivity for all instances.

Pacemaker timed out after some retries trying to bring the mysql up again.

What i have done as a quick fix.
1. Restarted the node that had mysql server instance down.
2. Restarted the l3 agent resource from crm.

Is this someking of galera/mysql bug?

thank you

Tags: galera mysql
affects: mos → fuel
tags: added: galera
tags: added: mysql
Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :
Revision history for this message
Stanislav Makar (smakar) wrote :

Could you please upload diagnostic snapshot ?

Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 5.1.2
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

QA team, please confirm either this bug can be reproduce for other releases

Changed in fuel:
importance: Undecided → High
Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

I will make the snapshot now and upload it shortly.

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

From the description is not clear root cause of issue. The links are referring to the version which not used in Fuel. The initial case from what I understood what network issues. Galera lost quorum, so it decided to perform IST or SST. During that time it looks like it was another network issue, so xtrabackup process was stuck in bad state. To resolve such network issues, there is a manual procedure how to restore MySQL cluster.

Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

We used the backup function on all three nodes at the SAME EXACT TIME.
So this may the reason on the subsequent abnormalities you may see....

the xtrabackup version installed is:
xtrabackup version 2.1.4 for Percona Server 5.1.70 unknown-linux-gnu (x86_64) (revision id: undefined)

and InnoDB Backup Utility v1.5.1-xtrabackup

here are the exact backup commands from cron:

mkdir -p /srv/xtrabackup_archive/differential/`date +"%Y"`/`date +"%m"`/`date +"%d"` && innobackupex --stream=xbstream --user=root --incremental --incremental-basedir=/srv/xtrabackup_checkpoints /tmp | gzip - > /srv/xtrabackup_archive/differential/`date +"%Y"`/`date +"%m"`/`date +"%d"`/`hostname`.gr1.etherland.eu_`date +"%F_%R"`_differntial.xbstream.gz

mkdir -p /srv/xtrabackup_archive/full/`date +"%Y"`/`date +"%m"`/`date +"%d"` && innobackupex --stream=xbstream --user=root --extra-lsndir=/srv/xtrabackup_checkpoints /tmp | gzip - > /srv/xtrabackup_archive/full/`date +"%Y"`/`date +"%m"`/`date +"%d"`/`hostname`.gr1.etherland.eu_`date +"%F_%R"`_full.xbstream.gz

mkdir -p /srv/xtrabackup_archive/incremental/`date +"%Y"`/`date +"%m"`/`date +"%d"` && innobackupex --stream=xbstream --user=root --extra-lsndir=/srv/xtrabackup_checkpoints --incremental --incremental-basedir=/srv/xtrabackup_checkpoints /tmp | gzip - > /srv/xtrabackup_archive/incremental/`date +"%Y"`/`date +"%m"`/`date +"%d"`/`hostname`.gr1.etherland.lan_`date +"%F_%R"`_incremental.xbstream.gz

snapshot is been uploaded. I will let you know.

Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :
Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

The issue stroked again today few hours ago. Here is the latest Diagnostic Snapshot:

https://www.dropbox.com/s/5wuhnr318srdddy/fuel-snapshot-2015-04-03_23-40-51.tgz?dl=0

This is a very serious bug and it stroked again in within few days. The mysql cluster went down again
and we lost connectivity with all instances and services that reside on the mysql cluster too.

Please advice. This is default fuel 5.1.1 installation with no other changes.

Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

Please note that again there was no connectivity issue between the controller nodes.

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

MySQL cluster is controlled by Pacemaker. When you introduce backup script pacemaker should know how to treat in such cases. I would suggest you to backport OCF script as there were some changes to handle such cases

http://docs.mirantis.com/openstack/fuel/fuel-6.0/operations.html#howto-backport-galera-pacemaker-ocf-script

Also, I would recommend to to read

http://docs.mirantis.com/openstack/fuel/fuel-6.0/operations.html#openstack-database-backup-and-restore-with-percona-xtrabackup

Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

Unfortunately the bug occurred while backup was disabled on all nodes !!!

Please advise

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Vasilios

In order to recover the database now, you need to find the node which is the best candidate (contains latest data), turn off galera everywhere, then login onto this node, change wsrep.cnf to disable galera plugin, spin up mysql server and run mysqldump to dump everything wherever you want. Then you need to start galera server only on this node by banning pacemaker resource for clone_p_mysql on all the other nodes. This will lead to this node starting as galera master, then you just unban all other nodes and they will start syncing from this node. For more information please ask for help in #fuel-dev IRC channel.

Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

Dear Vladimir,

Thank you for your reply and your valuable guidance on the recovery procedure.
The main purpose of this bug is to find the reason why the galera cluster went down second time within few days.

First I thought that it was the backup script that created the problem and I disabled it just to be safe that this was the problem.
But then.... after some days the issue come up again and the mysql cluster went down again.

So I am trying to figure out what caused this issue so that it never happens again.

thank you

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Closing as Invalid for 5.1.1-updates and 6.0-updates as there is no evindence this issue is relevant to 5.1.1 and 6.0

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.