Primary node leaving the cluster causes other nodes to crash
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC | Status tracked in 5.6 | |||||
5.5 |
Fix Released
|
High
|
Unassigned | |||
5.6 |
Fix Released
|
High
|
Unassigned |
Bug Description
The environment is a cluster composed of three nodes: db1, db2 and db3). db2 was acting as primary node when it warns for a gap in state sequence and kills some connections.
2014-05-26 14:42:25 27548 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-05-26 14:42:25 27548 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.42941S), skipping check
2014-05-26 14:42:27 27548 [Note] WSREP: killing local connection: 134143377
2014-05-26 14:42:46 27548 [Note] WSREP: killing local connection: 134129242
2014-05-26 14:42:46 27548 [Note] WSREP: killing local connection: 134143482
It got overloaded a few more times, in one them the gcomm background thread stalled for around 20 seconds:
2014-05-26 14:42:49 27548 [Warning] WSREP: last inactive check more than PT1.5S ago (PT21.1142S), skipping check
Because of that the node dropped from the group while it was requesting for SST but soon after it decided to abort, giving SST was not possible.
2014-05-26 14:43:06 27548 [ERROR] WSREP: Requesting state transfer failed: -125(Operation canceled)
2014-05-26 14:43:06 27548 [ERROR] WSREP: State transfer request failed unrecoverably: 125 (Operation canceled). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
2014-05-26 14:43:06 27548 [Note] WSREP: Closing send monitor...
2014-05-26 14:43:06 27548 [Note] WSREP: Closed send monitor.
2014-05-26 14:43:06 27548 [Note] WSREP: gcomm: terminating thread
2014-05-26 14:43:06 27548 [Note] WSREP: gcomm: joining thread
2014-05-26 14:43:06 27548 [Note] WSREP: gcomm: closing backend
While aborting it tried to leave the cluster gracefully by closeing the gcomm connection, which caused some message exchange between nodes and triggered a bug on db1 and db3, effectivelly crashing both nodes:
db1:
2014-05-26 14:43:21 2979 [Warning] WSREP: evs::proto(
2014-05-26 14:43:21 2979 [ERROR] WSREP: exception from gcomm, backend must be restarted: NodeMap:
db3:
2014-05-26 14:43:21 30104 [Warning] WSREP: evs::proto(
2014-05-26 14:43:21 30104 [ERROR] WSREP: exception from gcomm, backend must be restarted: NodeMap:
This happened in Percona-
I've also opened a bug on Galera's Github as suggested by Teemu: https:/
tags: | added: i42454 |
Changed in percona-xtradb-cluster: | |
status: | New → Confirmed |
Fix available in https:/ /github. com/codership/ galera/ commit/ ed9b4bb58c76e66 9ef0a96f02dccee 77691800eb