Bug #1190787 “Issue with signals and wsrep_sst_* scripts” : Bugs : Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-06-13

Changed in percona-xtradb-cluster:
milestone:	none → 5.5.31-25

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-06-14:

#1

To add what happens is as follows:

mysqld on joiner spawns wsrep_sst_xtrabackup which in turn spaans
netcat, xbcrypt and xbstream.

Now when xbcrypt dies, the whole setup hangs.

This may be a different bug altogether though.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-06-14:

#2

The issue mentioned in description of the bug - I see it on donor - when joiner hangs, if I SIGQUIT the donor, it quits as

=================
130614 6:22:49 InnoDB: Shutdown completed; log sequence number 9618707
130614 6:22:49 [ERROR] Plugin 'InnoDB' has ref_count=1 after shutdown.
130614 6:22:49 [Note] /pxc/bin/mysqld: Shutdown complete

Error in my_thread_global_end(): 1 threads didn't exit
==================

It leaves behind:

mysql 2453 0.0 0.2 147764 17832 pts/21 S 06:17 0:00 perl /usr/bin/innobackupex --galera-info --stream=xbstream --defaults-file=/pxc/etc/my.cnf.local --socket=/pxc/datadir/pxc.sock --user=root --password=test --encrypt=AES256 --encrypt-key=6F3AD9F428143F133FD7D50D77D91EA4 /tmp
mysql 2464 0.0 0.0 269048 5764 pts/21 Sl 06:17 0:00 xtrabackup_55 --defaults-file=/pxc/etc/my.cnf.local --defaults-group=mysqld --backup --suspend-at-end --target-dir=/tmp --tmpdir=/tmp --encrypt=AES256 --encrypt-key=6F3AD9F428143F133FD7D50D77D91EA4 --encrypt-threads=1 --stream=xbstream

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-06-14:

#3

After kill -9 the innobackupex, it displays

WSREP_SST: [ERROR] innobackupex finished with error: 137. Check /pxc/datadir//innobackup.backup.log (20130614 06:25:37.905)

on terminal with this

xtrabackup: innodb_log_files_in_group = 2
xtrabackup: innodb_log_file_size = 20971520
xtrabackup: using O_DIRECT
130614 6:17:57 InnoDB: Warning: allocated tablespace 10, old maximum was 0
>> log scanned up to (9618553)
[01] Encrypting and streaming ./ibdata1
^Gxtrabackup_55: Error writing file 'UNOPENED' (Errcode: 32)
encrypt: write to the destination file failed.
xb_stream_write_data() failed.
>> log scanned up to (9618553)
>> log scanned up to (9618553)
>> log scanned up to (9618553)
>> log scanned up to (9618553)
>> log scanned up to (9618553)
>> log scanned up to (9618553)
>> log scanned up to (9618553)
>> log scanned up to (9618553)
>> log scanned up to (9618553)

This probably is an xtrabackup bug.

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-08-06

Changed in percona-xtradb-cluster:
importance:	Undecided → Low

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-08-28

Changed in percona-xtradb-cluster:
milestone:	5.5.33-23.7.6 → future-5.5

Revision history for this message

Anthony Somerset (anthonysomerset) wrote on 2013-09-26:

#4

I can confirm similar issues, for me i was able to solve by downgrading percona-xtrabackup to percona-xtrabackup-20 on debian wheezy and all works correctly as it should for SST

Revision history for this message

Amol (ajkedar) wrote on 2013-09-30:

#5

Hi this bug affects us in production and we would like to know when is it fixed?
or is it fixed in 5.5.33?

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-10-22:

#6

@Anthony,

If downgrading xtrabackup helped, then it may be a different issue since this is not related to xtrabackup per se. Please report it separately with logs. Also which version of Xtrabackup did you downgrade from?

@Amol,

Again, need details on this. In general, this should affect you only if due to a bug in SST/xtrabackup, it gets stuck.

In general, PXB also has the 'xtrabackup alive after innobackupex death' bug fixed now

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-10-22:

#7

For the last 'bug fixed' mentioned above, https://bugs.launchpad.net/percona-xtrabackup/+bug/1135441 is the one fixed and released in 2.1.5

Revision history for this message

Przemek (pmalkowski) wrote on 2014-09-08:

#8

Download full text (6.7 KiB)

I can confirm this still happens in latest PXC versions. For example when we forget to allow SST TCP port (4444), the joiner is not cleaning it's processes after failed SST attempt.

percona33 mysql> select @@version,@@version_comment;
+--------------------+---------------------------------------------------------------------------------------------------+
| @@version | @@version_comment |
+--------------------+---------------------------------------------------------------------------------------------------+
| 5.6.20-68.0-56-log | Percona XtraDB Cluster (GPL), Release rel68.0, Revision 888, WSREP version 25.7, wsrep_25.7.r4126 |
+--------------------+---------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

[root@percona33 ~]# iptables -I INPUT -p tcp --dport 4444 -j REJECT

[root@percona33 ~]# service mysql stop
Shutting down MySQL (Percona XtraDB Cluster).... SUCCESS!
[root@percona33 ~]# rm -f /var/lib/mysql/grastate.dat

[root@percona33 ~]# service mysql start
Starting MySQL (Percona XtraDB Cluster)...State transfer in progress, setting sleep higher
. ERROR! The server quit without updating PID file (/var/lib/mysql/percona33.pid).
ERROR! MySQL (Percona XtraDB Cluster) server startup failed!

-- in the error log on joiner:
2014-09-08 11:01:09 16425 [Note] WSREP: New cluster view: global state: c3b203a1-3435-11e4-aa44-9605577e3230:0, view# 5: Primary, number of nodes: 3, my index: 0, protocol version 3
2014-09-08 11:01:09 16425 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-09-08 11:01:09 16425 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.4.40' --auth 'root:' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '16425' --binlog 'percona33-bin' '
WSREP_SST: [INFO] Streaming with xbstream (20140908 11:01:09.931)
WSREP_SST: [INFO] Using socat as streamer (20140908 11:01:09.933)
WSREP_SST: [INFO] Evaluating timeout 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140908 11:01:10.679)
2014-09-08 11:01:10 16425 [Note] WSREP: Prepared SST request: xtrabackup-v2|192.168.4.40:4444/xtrabackup_sst
2014-09-08 11:01:10 16425 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-09-08 11:01:10 16425 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-09-08 11:01:10 16425 [Note] WSREP: Service thread queue flushed.
2014-09-08 11:01:10 16425 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
2014-09-08 11:01:10 16425 [Note] WSREP: Service thread queue flushed.
2014-09-08 11:01:10 16425 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (c3b203a1-3435-11e4-aa44-9605577e3230): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():455. IST will be unavailable.
2014-09-08 11:01:10 16425 [Note] WSREP: Member 0.0 (percona33) requested state transfer from '*any*'. Selected 1.0 (percona22)(SYNCED) as donor.
2014-09-08 11:...

I can confirm this still happens in latest PXC versions. For example when we forget to allow SST TCP port (4444),  the joiner is not cleaning it's processes after failed SST attempt.

percona33 mysql> select @@version,@@version_comment;
+--------------------+---------------------------------------------------------------------------------------------------+
| @@version          | @@version_comment                                                                                 |
+--------------------+---------------------------------------------------------------------------------------------------+
| 5.6.20-68.0-56-log | Percona XtraDB Cluster (GPL), Release rel68.0, Revision 888, WSREP version 25.7, wsrep_25.7.r4126 |
+--------------------+---------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

[root@percona33 ~]# iptables -I INPUT -p tcp --dport 4444 -j REJECT

[root@percona33 ~]# service mysql stop
Shutting down MySQL (Percona XtraDB Cluster).... SUCCESS! 
[root@percona33 ~]# rm -f /var/lib/mysql/grastate.dat

[root@percona33 ~]# service mysql start
Starting MySQL (Percona XtraDB Cluster)...State transfer in progress, setting sleep higher
. ERROR! The server quit without updating PID file (/var/lib/mysql/percona33.pid).
 ERROR! MySQL (Percona XtraDB Cluster) server startup failed!

-- in the error log on joiner:
2014-09-08 11:01:09 16425 [Note] WSREP: New cluster view: global state: c3b203a1-3435-11e4-aa44-9605577e3230:0, view# 5: Primary, number of nodes: 3, my index: 0, protocol version 3
2014-09-08 11:01:09 16425 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-09-08 11:01:09 16425 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.4.40' --auth 'root:' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '16425' --binlog 'percona33-bin' '
WSREP_SST: [INFO] Streaming with xbstream (20140908 11:01:09.931)
WSREP_SST: [INFO] Using socat as streamer (20140908 11:01:09.933)
WSREP_SST: [INFO] Evaluating timeout 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140908 11:01:10.679)
2014-09-08 11:01:10 16425 [Note] WSREP: Prepared SST request: xtrabackup-v2|192.168.4.40:4444/xtrabackup_sst
2014-09-08 11:01:10 16425 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-09-08 11:01:10 16425 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-09-08 11:01:10 16425 [Note] WSREP: Service thread queue flushed.
2014-09-08 11:01:10 16425 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
2014-09-08 11:01:10 16425 [Note] WSREP: Service thread queue flushed.
2014-09-08 11:01:10 16425 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (c3b203a1-3435-11e4-aa44-9605577e3230): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():455. IST will be unavailable.
2014-09-08 11:01:10 16425 [Note] WSREP: Member 0.0 (percona33) requested state transfer from '*any*'. Selected 1.0 (percona22)(SYNCED) as donor.
2014-09-08 11:01:10 16425 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 0)
2014-09-08 11:01:10 16425 [Note] WSREP: Requesting state transfer: success, donor: 1
2014-09-08 11:01:11 16425 [Warning] WSREP: 1.0 (percona22): State transfer to 0.0 (percona33) failed: -32 (Broken pipe)
2014-09-08 11:01:11 16425 [ERROR] WSREP: gcs/src/gcs_group.cpp:int gcs_group_handle_join_msg(gcs_group_t*, const gcs_recv_msg_t*)():722: Will never receive state. Need to abort.
2014-09-08 11:01:11 16425 [Note] WSREP: gcomm: terminating thread
2014-09-08 11:01:11 16425 [Note] WSREP: gcomm: joining thread
2014-09-08 11:01:11 16425 [Note] WSREP: gcomm: closing backend
2014-09-08 11:01:11 16425 [Note] WSREP: gcomm: closed
2014-09-08 11:01:11 16425 [Note] WSREP: /usr/sbin/mysqld: Terminated.
140908 11:01:11 mysqld_safe mysqld from pid file /var/lib/mysql/percona33.pid ended

[root@percona33 ~]# ps fax|tail -4
16436 pts/1    S      0:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role joiner --address 192.168.4.40 --auth root: --datadir /var/lib/mysql/ --defaults-file /etc/my.cnf --parent 16425 --binlog percona33-bin
16651 pts/1    S      0:00  \_ timeout 100 socat -u TCP-LISTEN:4444,reuseaddr stdio
16653 pts/1    S      0:00  |   \_ socat -u TCP-LISTEN:4444,reuseaddr stdio
16652 pts/1    S      0:00  \_ xbstream -x

[root@percona33 ~]# kill 16436

[root@percona33 ~]# ps fax|tail -4
16436 pts/1    S      0:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role joiner --address 192.168.4.40 --auth root: --datadir /var/lib/mysql/ --defaults-file /etc/my.cnf --parent 16425 --binlog percona33-bin
16651 pts/1    S      0:00  \_ timeout 100 socat -u TCP-LISTEN:4444,reuseaddr stdio
16653 pts/1    S      0:00  |   \_ socat -u TCP-LISTEN:4444,reuseaddr stdio
16652 pts/1    S      0:00  \_ xbstream -x

-- Also if we allow the TCP port and restart, it fails again as port is already opened:

[root@percona33 ~]# iptables -F
[root@percona33 ~]# service mysql restart
Shutting down MySQL (Percona XtraDB Cluster) ERROR! MySQL (Percona XtraDB Cluster) PID file could not be found!
 ERROR! MySQL (Percona XtraDB Cluster) is not running, but lock file (/var/lock/subsys/mysql) exists
Stale sst_in_progress file in datadir
Starting MySQL (Percona XtraDB Cluster)State transfer in progress, setting sleep higher
.. ERROR! The server quit without updating PID file (/var/lib/mysql/percona33.pid).
 ERROR! MySQL (Percona XtraDB Cluster) server startup failed!
 ERROR! Failed to restart server.

2014-09-08 11:05:39 17839 [Note] WSREP: Requesting state transfer: success, donor: 1
WSREP_SST: [INFO] Evaluating timeout 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140908 11:05:39.062)
2014/09/08 11:05:39 socat[18063] E bind(3, {AF=2 0.0.0.0:4444}, 16): Address already in use
WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 1 0 (20140908 11:05:39.070)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20140908 11:05:39.072)
WSREP_SST: [INFO] Removing the sst_in_progress file (20140908 11:05:39.074)
2014-09-08 11:05:39 17839 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.4.40' --auth 'root:' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '17839' --binlog 'percona33-bin' : 32 (Broken pipe)
2014-09-08 11:05:39 17839 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2014-09-08 11:05:39 17839 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2014-09-08 11:05:39 17839 [ERROR] Aborting

-- Only kill -9 works for cleaning stalled SST processes.
-- The same issue applies to PXC 5.5.39.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2015-01-31:

#9

This has been in recent fixes, marking as fix committed.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2015-02-01:

#10

As has been noted, it appears that posix_spawn does not seem to provide for passing a signal from parent to child. Alternative fix https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1382797 seems to be non-standard and Linux-specific.

This leaves us with SST script needs to watch the parent process status via supplied PID and take appropriate actions when the parent dies.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2015-02-01:

#11

@Alex,

It is possible for us to add ifdef for Linux in the fix of lp:1382797, so that on linux, it works as expected and as before elsewhere.

Revision history for this message

Alexey Kopytov (akopytov) wrote on 2015-02-01:

#12

Shouldn't this bug be marked as a duplicate of bug #1382797?

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2015-02-01:

#13

Raghu,

I think it is a question first to Alexey (akopytov). But as far as I'm concerned
1) as I understand that will be a very big ifdef
2) this will make requirements for SST scripts differ between Linux and other platforms.
The latter consideration kinda kills the idea to me.

And yes, this and lp:1382797 must be duplicates

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2015-02-02:

#14

Marked as duplicate as requested.

But, yes, even if we add ifdef, there are some bits that need to be in non-Linux code as well - setsid being one of them.

	Status	Importance	Assigned to	Milestone
MySQL patches by Codership	New	Undecided	Unassigned
5.5	Won't Fix	Undecided	Unassigned
5.6	Won't Fix	Undecided	Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Confirmed	Low	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC future-5.5
5.6	Fix Committed	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC future-5.6

Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Issue with signals and wsrep_sst_* scripts

Bug Description

Other bug subscribers

Remote bug watches