Primary node leaving the cluster causes other nodes to crash

Bug #1323412 reported by Fernando Laudares Camargos
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
High
Unassigned
5.6
Fix Released
High
Unassigned

Bug Description

The environment is a cluster composed of three nodes: db1, db2 and db3). db2 was acting as primary node when it warns for a gap in state sequence and kills some connections.

2014-05-26 14:42:25 27548 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-05-26 14:42:25 27548 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.42941S), skipping check
2014-05-26 14:42:27 27548 [Note] WSREP: killing local connection: 134143377
2014-05-26 14:42:46 27548 [Note] WSREP: killing local connection: 134129242
2014-05-26 14:42:46 27548 [Note] WSREP: killing local connection: 134143482

It got overloaded a few more times, in one them the gcomm background thread stalled for around 20 seconds:

2014-05-26 14:42:49 27548 [Warning] WSREP: last inactive check more than PT1.5S ago (PT21.1142S), skipping check

Because of that the node dropped from the group while it was requesting for SST but soon after it decided to abort, giving SST was not possible.

2014-05-26 14:43:06 27548 [ERROR] WSREP: Requesting state transfer failed: -125(Operation canceled)
2014-05-26 14:43:06 27548 [ERROR] WSREP: State transfer request failed unrecoverably: 125 (Operation canceled). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
2014-05-26 14:43:06 27548 [Note] WSREP: Closing send monitor...
2014-05-26 14:43:06 27548 [Note] WSREP: Closed send monitor.
2014-05-26 14:43:06 27548 [Note] WSREP: gcomm: terminating thread
2014-05-26 14:43:06 27548 [Note] WSREP: gcomm: joining thread
2014-05-26 14:43:06 27548 [Note] WSREP: gcomm: closing backend

While aborting it tried to leave the cluster gracefully by closeing the gcomm connection, which caused some message exchange between nodes and triggered a bug on db1 and db3, effectivelly crashing both nodes:

db1:
2014-05-26 14:43:21 2979 [Warning] WSREP: evs::proto(9bdb737e-df4a-11e3-87c9-eab020c42bd0, GATHER, view_id(REG,9bdb737e-df4a-11e3-87c9-eab020c42bd0,174)) install timer expired
2014-05-26 14:43:21 2979 [ERROR] WSREP: exception from gcomm, backend must be restarted: NodeMap::value(i).leave_message() == 0: (FATAL)

db3:
2014-05-26 14:43:21 30104 [Warning] WSREP: evs::proto(c7a2a117-daa4-11e3-8b73-863bb950f40a, GATHER, view_id(REG,9bdb737e-df4a-11e3-87c9-eab020c42bd0,174)) install timer expired
2014-05-26 14:43:21 30104 [ERROR] WSREP: exception from gcomm, backend must be restarted: NodeMap::value(i).leave_message() == 0: (FATAL)

This happened in Percona-XtraDB-Cluster-galera-3-3.5-1.216.rhel6.x86_64, built with Galera 25.3.4

I've also opened a bug on Galera's Github as suggested by Teemu: https://github.com/codership/galera/issues/41

Tags: i42454
tags: added: i42454
Changed in percona-xtradb-cluster:
status: New → Confirmed
Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1005

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.