Comment 10 for bug 1373796

Revision history for this message
David Bennett (dbpercona) wrote :

More research: There is a very subtle difference between the logs in the VM that is exhibiting the stuck Joining problem that the VM that is successfully syncing:

-----
Good: Joining becomes Syncing
-----
2014-10-16 03:15:56 3437 [Note] WSREP: 0.0 (192.168.122.166): State transfer from 1.0 (192.168.122.184) complete.
2014-10-16 03:15:56 3437 [Note] WSREP: Shifting JOINER -> JOINED (TO: 1)
2014-10-16 03:15:56 3437 [Note] WSREP: Member 0.0 (192.168.122.166) synced with group.
2014-10-16 03:15:56 3437 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1)
2014-10-16 03:15:56 3437 [Note] WSREP: Synchronized with group, ready for connections

-----
Bad: Stucking in Joining
-----
2014-10-16 07:41:27 13120 [Note] WSREP: 1.0 (10.177.192.249): State transfer from 0.0 (10.177.193.113) complete.
2014-10-16 07:41:27 13120 [Note] WSREP: Shifting JOINER -> JOINED (TO: 1184793)
2014-10-16 07:41:27 13120 [Note] WSREP: Member 1.0 (10.177.192.249) synced with group.
2014-10-16 07:41:27 13120 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1184793)
-----

The last line 'Synchronized with group, ready for connections' comes from the application callback sql/wsrep_mysqld.cc:wsrep_synced_cb(void* app_ctx).

In the case of the stuck sync, Galera is setting it's state from JOINED to SYNCED but is not issuing the callback to the application.

I have traced this to the Galera method gcs/src/gcs.cpp:gcs_recv_thread(void *arg).

At this point: https://github.com/percona/galera/blob/3.x/gcs/src/gcs.cpp#L1218

The state change is being recorded by the galera engine. I suspect that something in the gcs_act_rcvd structure is preventing the subsequent callback from occuring in certain circumstances.