More research: There is a very subtle difference between the logs in the VM that is exhibiting the stuck Joining problem that the VM that is successfully syncing:
-----
Good: Joining becomes Syncing
-----
2014-10-16 03:15:56 3437 [Note] WSREP: 0.0 (192.168.122.166): State transfer from 1.0 (192.168.122.184) complete.
2014-10-16 03:15:56 3437 [Note] WSREP: Shifting JOINER -> JOINED (TO: 1)
2014-10-16 03:15:56 3437 [Note] WSREP: Member 0.0 (192.168.122.166) synced with group.
2014-10-16 03:15:56 3437 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1)
2014-10-16 03:15:56 3437 [Note] WSREP: Synchronized with group, ready for connections
-----
Bad: Stucking in Joining
-----
2014-10-16 07:41:27 13120 [Note] WSREP: 1.0 (10.177.192.249): State transfer from 0.0 (10.177.193.113) complete.
2014-10-16 07:41:27 13120 [Note] WSREP: Shifting JOINER -> JOINED (TO: 1184793)
2014-10-16 07:41:27 13120 [Note] WSREP: Member 1.0 (10.177.192.249) synced with group.
2014-10-16 07:41:27 13120 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1184793)
-----
The last line 'Synchronized with group, ready for connections' comes from the application callback sql/wsrep_mysqld.cc:wsrep_synced_cb(void* app_ctx).
In the case of the stuck sync, Galera is setting it's state from JOINED to SYNCED but is not issuing the callback to the application.
I have traced this to the Galera method gcs/src/gcs.cpp:gcs_recv_thread(void *arg).
The state change is being recorded by the galera engine. I suspect that something in the gcs_act_rcvd structure is preventing the subsequent callback from occuring in certain circumstances.
More research: There is a very subtle difference between the logs in the VM that is exhibiting the stuck Joining problem that the VM that is successfully syncing:
-----
Good: Joining becomes Syncing
-----
2014-10-16 03:15:56 3437 [Note] WSREP: 0.0 (192.168.122.166): State transfer from 1.0 (192.168.122.184) complete.
2014-10-16 03:15:56 3437 [Note] WSREP: Shifting JOINER -> JOINED (TO: 1)
2014-10-16 03:15:56 3437 [Note] WSREP: Member 0.0 (192.168.122.166) synced with group.
2014-10-16 03:15:56 3437 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1)
2014-10-16 03:15:56 3437 [Note] WSREP: Synchronized with group, ready for connections
-----
Bad: Stucking in Joining
-----
2014-10-16 07:41:27 13120 [Note] WSREP: 1.0 (10.177.192.249): State transfer from 0.0 (10.177.193.113) complete.
2014-10-16 07:41:27 13120 [Note] WSREP: Shifting JOINER -> JOINED (TO: 1184793)
2014-10-16 07:41:27 13120 [Note] WSREP: Member 1.0 (10.177.192.249) synced with group.
2014-10-16 07:41:27 13120 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1184793)
-----
The last line 'Synchronized with group, ready for connections' comes from the application callback sql/wsrep_ mysqld. cc:wsrep_ synced_ cb(void* app_ctx).
In the case of the stuck sync, Galera is setting it's state from JOINED to SYNCED but is not issuing the callback to the application.
I have traced this to the Galera method gcs/src/ gcs.cpp: gcs_recv_ thread( void *arg).
At this point: https:/ /github. com/percona/ galera/ blob/3. x/gcs/src/ gcs.cpp# L1218
The state change is being recorded by the galera engine. I suspect that something in the gcs_act_rcvd structure is preventing the subsequent callback from occuring in certain circumstances.