rabbitmq_ctl stop didn't work and subsequent start gets stuck (missing /etc/hosts entry)

Bug #1592200 reported by Andreas Hasenack
34
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Landscape Server
New
Undecided
Unassigned
16.06
Fix Released
High
Andreas Hasenack
rabbitmq-server (Juju Charms Collection)
Fix Released
High
James Page

Bug Description

cs:trusty/rabbitmq-server-47
landscape reference: https://ci.lscape.net/job/landscape-system-tests/2762/

We had several autopilot runs fail with a juju run of /bin/true timeout (1h is our timeout). We do this run right after we related the hacluster services, and before we relate all the principals.

Debugging shows that in one of the 3 rabbit units "rabbitmq_ctl stop" right after the /etc/hosts file was changed didn't really work, and the subsequent start then just doesn't work.

This is the story we were able to reconstruct from the logs (all attached):

rabbitmq-server/1 stuck:

      rabbitmq-server/0:
        agent-state: started
        agent-status: {current: idle, since: '12 Jun 2016 12:27:06Z', version: 1.25.5}
        agent-version: 1.25.5
        machine: 5/lxc/2
        open-ports: [5672/tcp]
        public-address: 10.96.4.112
        workload-status: {current: active, message: Unit is ready and clustered, since: '12
            Jun 2016 12:27:04Z'}
      rabbitmq-server/1:
        agent-state: started
        agent-status: {current: executing, message: running cluster-relation-changed
            hook, since: '12 Jun 2016 11:07:29Z', version: 1.25.5}
        agent-version: 1.25.5
        machine: 2/lxc/0
        open-ports: [5672/tcp]
        public-address: 10.96.8.38
        workload-status: {current: maintenance, message: Starting rabbitmq applilcation,
          since: '12 Jun 2016 11:09:23Z'}
      rabbitmq-server/2:
        agent-state: started
        agent-status: {current: idle, since: '12 Jun 2016 12:25:06Z', version: 1.25.5}
        agent-version: 1.25.5
        machine: 1/lxc/0
        open-ports: [5672/tcp]
        public-address: 10.96.3.15
        workload-status: {current: active, message: Unit is ready and clustered, since: '12
            Jun 2016 12:25:04Z'}

rabbitmq on that unit confirmed started at 17:07:36:
2016-06-12 11:07:29 INFO cluster-relation-changed * Stopping message broker rabbitmq-server
2016-06-12 11:07:32 INFO cluster-relation-changed ...done.
2016-06-12 11:07:33 INFO cluster-relation-changed * Restarting message broker rabbitmq-server
2016-06-12 11:07:35 INFO cluster-relation-changed ...done.
2016-06-12 11:07:35 DEBUG juju-log cluster:4: Running ['/usr/sbin/rabbitmqctl', 'wait', '/var/run/rabbitmq/pid']
2016-06-12 11:07:35 INFO cluster-relation-changed Waiting for 'rabbit@juju-machine-2-lxc-0' ...
2016-06-12 11:07:35 INFO cluster-relation-changed pid is 11028 ...
2016-06-12 11:07:36 INFO cluster-relation-changed ...done.

hosts file updated at 11:07:37:
2016-06-12 11:07:37 INFO juju-log cluster:4: Updating hosts file with: {u'10.96.4.112': u'juju-machine-5-lxc-2'} (current: ['127.0.0.1 localhost\n', '\n', '# The following lines are desirable for IPv6 capable hosts\n', '::1 ip6-localhost ip6-loopback\n', 'fe00::0 ip6-localnet\n', 'ff00::0 ip6-mcastprefix\n', 'ff02::1 ip6-allnodes\n', 'ff02::2 ip6-allrouters\n', 'ff02::3 ip6-allhosts\n'])

Here it gets interesting.

2016-06-12 11:08:48 DEBUG juju-log cluster:4: Running ['/usr/sbin/rabbitmqctl', 'stop_app']
2016-06-12 11:08:49 INFO cluster-relation-changed Stopping node 'rabbit@juju-machine-2-lxc-0' ...
2016-06-12 11:08:57 INFO cluster-relation-changed ...done.
2016-06-12 11:09:23 DEBUG juju-log cluster:4: Running ['/usr/sbin/rabbitmqctl', 'start_app']
2016-06-12 11:09:23 INFO cluster-relation-changed Starting node 'rabbit@juju-machine-2-lxc-0' ...

Comparing the timestamps with the actual rabbit logs, we can see that when start was issued at 11:09:23, rabbit was already failing hard, probably as a result of the stop action:
=INFO REPORT==== 12-Jun-2016::11:07:35 ===
Server startup complete; 0 plugins started.

=INFO REPORT==== 12-Jun-2016::11:08:49 ===
Stopping RabbitMQ

=INFO REPORT==== 12-Jun-2016::11:08:49 ===
stopped TCP Listener on [::]:5672

=WARNING REPORT==== 12-Jun-2016::11:09:11 ===
global: 'rabbit@juju-machine-2-lxc-0' failed to connect to 'rabbit@juju-machine-1-lxc-0'

=INFO REPORT==== 12-Jun-2016::11:09:13 ===
Clustering with ['rabbit@juju-machine-5-lxc-2'] as disc node

=ERROR REPORT==== 12-Jun-2016::11:09:13 ===
Mnesia('rabbit@juju-machine-2-lxc-0'): ** ERROR ** Mnesia on 'rabbit@juju-machine-2-lxc-0' could not connect to node(s) ['rabbit@juju-machine-1-lxc-0']

=ERROR REPORT==== 12-Jun-2016::11:09:13 ===
Mnesia('rabbit@juju-machine-2-lxc-0'): ** ERROR ** Mnesia on 'rabbit@juju-machine-2-lxc-0' could not connect to node(s) ['rabbit@juju-machine-1-lxc-0']
(... repeats about half a million times ...)
$ grep "could not connect to node" <email address hidden> |wc -l
438068

ps output of 2/lxc/0 (rabbitmq-server/1, also attached, don't worry about formatting here in the bug). You can see the start attempt, and the already-running rabbit, as well as the stuck juju run:

root 1012 0.0 0.2 478564 40384 ? Ssl 11:03 0:00 /var/lib/juju/tools/unit-rabbitmq-server-1/jujud unit --data-dir /var/lib/juju --unit-name rabbitmq-server/1 --debug
root 11373 0.0 0.1 92432 23328 ? S 11:07 0:01 \_ /usr/bin/python /var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/cluster-relation-changed
root 11737 0.0 0.0 4448 1688 ? S 11:09 0:00 \_ /bin/sh /usr/sbin/rabbitmqctl start_app
root 11745 0.0 0.0 61680 3280 ? S 11:09 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "start_app"
rabbitmq 11746 0.0 0.0 4448 684 ? Ss 11:09 0:00 \_ sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "start_app"
rabbitmq 11747 0.0 0.2 457144 48444 ? Sl 11:09 0:00 \_ /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../ebin -noshell -noinput -hidden -sname rabbitmqctl11747 -boot start_clean -s rabbit_control_main -nodename rabbit@juju-machine-2-lxc-0 -extra start_app
rabbitmq 11787 0.0 0.0 7464 972 ? Ss 11:09 0:00 \_ inet_gethost 4
rabbitmq 11788 0.0 0.0 13784 1744 ? S 11:09 0:00 \_ inet_gethost 4
rabbitmq 7984 0.0 0.0 7500 1452 ? S 11:06 0:00 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon
rabbitmq 11009 0.0 0.0 4448 748 ? S 11:07 0:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq 11028 17.8 0.4 2258272 71584 ? Sl 11:07 12:02 \_ /usr/lib/erlang/erts-5.10.4/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../ebin -noshell -noinput -s rabbit boot -sname rabbit@juju-machine-2-lxc-0 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/<email address hidden>"} -rabbit sasl_error_logger {file,"/<email address hidden>"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.2.4/sbin/../plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@juju-machine-2-lxc-0-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@juju-machine-2-lxc-0"
rabbitmq 11173 1.4 0.0 7464 880 ? Ss 11:07 0:58 \_ inet_gethost 4
rabbitmq 11174 4.0 0.0 13784 1920 ? R 11:07 2:45 \_ inet_gethost 4
ubuntu 11919 0.0 0.0 11116 2664 ? Ss 11:12 0:00 /bin/bash -s
ubuntu 11920 0.0 0.2 255080 35404 ? Sl 11:12 0:00 \_ juju-run --no-context /bin/true

Tags: landscape
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The process listing of each unit can be found in var/log/ps-fauxww.txt

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
description: updated
tags: removed: kanban-cross-team
tags: added: cdo-qa-blocker
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

We downgraded the rabbitmq-server charm to cs:trusty/rabbitmq-server-45 and haven't seen this failure anymore in several runs.

David Britton (dpb)
no longer affects: landscape/16.05
James Page (james-page)
Changed in rabbitmq-server (Juju Charms Collection):
assignee: nobody → James Page (james-page)
status: New → In Progress
milestone: none → 16.07
importance: Undecided → High
Revision history for this message
James Page (james-page) wrote :

cs:~james-page/xenial/rabbitmq-server-bug1590085

for testing...

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Hm, do you have a trusty version? We hit this doing mitaka deploys on trusty.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.openstack.org/338865
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=0ded449e8d32428baf7992cfc6999f82e9996c1a
Submitter: Jenkins
Branch: master

commit 0ded449e8d32428baf7992cfc6999f82e9996c1a
Author: James Page <email address hidden>
Date: Thu Jul 7 11:48:52 2016 +0100

    Ensure rabbit hosts are resolvable

    The update_hosts_file function requires a full map of entries to
    write into /etc/hosts; the previous use of this function tried
    to write individual host entries only.

    Ensure that a full map of self + peers is written at appropriate
    points in a units lifecycle to ensure self and peer hostname
    resolvability.

    (also fixed a typo in a status message)

    Change-Id: If40b31a55641861a1e03985f318e012b545fcc25
    Closes-Bug: 1592200
    Closes-Bug: 1590085

Changed in rabbitmq-server (Juju Charms Collection):
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (stable/16.04)

Fix proposed to branch: stable/16.04
Review: https://review.openstack.org/339597

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (stable/16.04)

Reviewed: https://review.openstack.org/339597
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=c200f4fd13a141f79aef34c426c1e7e2633766ce
Submitter: Jenkins
Branch: stable/16.04

commit c200f4fd13a141f79aef34c426c1e7e2633766ce
Author: James Page <email address hidden>
Date: Thu Jul 7 11:48:52 2016 +0100

    Ensure rabbit hosts are resolvable

    The update_hosts_file function requires a full map of entries to
    write into /etc/hosts; the previous use of this function tried
    to write individual host entries only.

    Ensure that a full map of self + peers is written at appropriate
    points in a units lifecycle to ensure self and peer hostname
    resolvability.

    (also fixed a typo in a status message)

    Change-Id: If40b31a55641861a1e03985f318e012b545fcc25
    Closes-Bug: 1592200
    Closes-Bug: 1590085
    (cherry picked from commit 0ded449e8d32428baf7992cfc6999f82e9996c1a)

James Page (james-page)
Changed in rabbitmq-server (Juju Charms Collection):
status: Fix Committed → Fix Released
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Re: rabbitmq_ctl stop didn't work and subsequent start gets stuck
Download full text (3.3 KiB)

I still saw this with cs:trusty/rabbitmq-server-48

juju status was showing this about rabbit:
rabbitmq-server/0 active idle 1.25.5 2/lxc/0 5672/tcp 10.96.6.63 Unit is ready and clustered
rabbitmq-server/1 maintenance executing 1.25.5 1/lxc/2 5672/tcp 10.96.8.95 Starting RabbitMQ
rabbitmq-server/2 active idle 1.25.5 5/lxc/0 5672/tcp 10.96.9.186 Unit is ready and clustered

and the logs on rabbitmq-server/1 were full of:
Mnesia('rabbit@juju-machine-1-lxc-2'): ** ERROR ** Mnesia on 'rabbit@juju-machine-1-lxc-2' could not connect to node(s) ['rabbit@juju-machine-5-lxc-0']

/etc/hosts on rabbitmq-server/1 had:
# cat /etc/hosts
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
10.96.8.95 juju-machine-1-lxc-2
10.96.6.63 juju-machine-2-lxc-0

Process listing was:
root 922 0.0 0.1 511780 22576 ? Ssl 16:07 0:00 /var/lib/juju/tools/machine-1-lxc-2/jujud machine --data-dir /var/lib/juju --machine-id 1/lxc/2 --debug
root 1025 0.0 0.1 487008 22728 ? Ssl 16:20 0:00 /var/lib/juju/tools/unit-rabbitmq-server-1/jujud unit --data-dir /var/lib/juju --unit-name rabbitmq-server/1 --debug
root 10406 0.0 0.1 92464 20136 ? S 16:26 0:02 \_ /usr/bin/python /var/lib/juju/agents/unit-rabbitmq-server-1/charm/hooks/cluster-relation-changed
root 11222 0.0 0.0 4440 668 ? S 16:27 0:00 \_ /bin/sh /usr/sbin/rabbitmqctl start_app
root 11230 0.0 0.0 61672 1676 ? S 16:27 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "start_app"
rabbitmq 11231 0.0 0.0 4440 644 ? Ss 16:27 0:00 \_ sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "start_app"
rabbitmq 11232 0.0 0.1 370876 24084 ? Sl 16:27 0:00 \_ /usr/lib/erlang/erts-5.10.4/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitm
rabbitmq 11264 0.0 0.0 7456 428 ? Ss 16:27 0:00 \_ inet_gethost 4
rabbitmq 11265 0.0 0.0 9548 632 ? S 16:27 0:00 \_ inet_gethost 4
rabbitmq 8002 0.0 0.0 7492 320 ? S 16:24 0:00 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon
rabbitmq 10803 0.0 0.0 4440 648 ? S 16:26 0:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq 10834 19.2 0.3 2174496 55720 ? Sl 16:26 9:16 \_ /usr/lib/erlang/erts-5.10.4/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /us
rabbitmq 10943 1.6 0.0 7456 428 ? Ss 16:26 0:46 \_ inet_gethost 4
rabbitmq 10944 5.2 0.0 13776 784 ? S 16:26 ...

Read more...

Changed in rabbitmq-server (Juju Charms Collection):
status: Fix Released → Confirmed
Liam Young (gnuoy)
Changed in rabbitmq-server (Juju Charms Collection):
milestone: 16.07 → 16.10
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Something odd, in that the hosts file only has records for 2 of the 3 units:

10.96.8.95 juju-machine-1-lxc-2
10.96.6.63 juju-machine-2-lxc-0

Notably missing:
juju-machine-5-lxc-0

Ursula Junque (ursinha)
summary: - rabbitmq_ctl stop didn't work and subsequent start gets stuck
+ rabbitmq_ctl stop didn't work and subsequent start gets stuck (missing
+ /etc/hosts/ entry)
summary: rabbitmq_ctl stop didn't work and subsequent start gets stuck (missing
- /etc/hosts/ entry)
+ /etc/hosts entry)
James Page (james-page)
Changed in rabbitmq-server (Juju Charms Collection):
status: Confirmed → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)

Fix proposed to branch: master
Review: https://review.openstack.org/373095

Changed in rabbitmq-server (Juju Charms Collection):
status: Triaged → In Progress
Revision history for this message
James Page (james-page) wrote :

I've pushed the WIP to:

  cs:~james-page/rabbitmq-server-bug1592200

it would be great if the landscape team could use that charm for a few test runs - I'd like to get confidence that the change is good before we backport to stable as well.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Done, we should have some data by tomorrow.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Since the change to cs:~james-page/trusty/rabbitmq-server-bug1592200-0 for rabbit only, we have had 10 consecutive successes so far with the -next charms. Before we would have at most 3 consecutive green runs before a failure because of this bug. We are not exercising the stable charms + this branch, though.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.openstack.org/373095
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=af00c66eba0f94ebd40f4287785a26a5a0ec469b
Submitter: Jenkins
Branch: master

commit af00c66eba0f94ebd40f4287785a26a5a0ec469b
Author: James Page <email address hidden>
Date: Tue Sep 20 10:26:04 2016 +0100

    Revert changes relating to management of /etc/hosts

    Switching to active management by the charm of /etc/hosts
    resulted in a number of race conditions during cluster
    formation as a nasty side effect.

    Revert changes to the charm related to this switch, as charm
    versions prior to these changes are far more reliable to
    deploy.

    Further work is required to support MAAS 2.0 with Juju 2.0.

    This reverts commit 0ded449e8d32428baf7992cfc6999f82e9996c1a.
    This reverts commit 7ffdcd204e6550cf5dcad57ed07ea302c1eedcce.

    Closes-Bug: 1592200

    Change-Id: Ib2d6714ac1c4b32542cea02c24f75d7ffbde8282

Changed in rabbitmq-server (Juju Charms Collection):
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (stable/16.07)

Fix proposed to branch: stable/16.07
Review: https://review.openstack.org/375460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (stable/16.07)

Reviewed: https://review.openstack.org/375460
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=daa4214acf1d967bf0bb58f6a79e21bad58369da
Submitter: Jenkins
Branch: stable/16.07

commit daa4214acf1d967bf0bb58f6a79e21bad58369da
Author: Ryan Beisner <email address hidden>
Date: Thu Sep 8 22:09:52 2016 +0000

    Revert changes relating to management of /etc/hosts

    Switching to active management by the charm of /etc/hosts
    resulted in a number of race conditions during cluster
    formation as a nasty side effect.

    Revert changes to the charm related to this switch, as charm
    versions prior to these changes are far more reliable to
    deploy.

    Further work is required to support MAAS 2.0 with Juju 2.0.

    This reverts commit 0ded449e8d32428baf7992cfc6999f82e9996c1a.
    This reverts commit 7ffdcd204e6550cf5dcad57ed07ea302c1eedcce.

    Closes-Bug: 1592200

    Change-Id: Ib2d6714ac1c4b32542cea02c24f75d7ffbde8282
    (cherry picked from commit af00c66eba0f94ebd40f4287785a26a5a0ec469b)

James Page (james-page)
Changed in rabbitmq-server (Juju Charms Collection):
status: Fix Committed → Fix Released
Chris Gregan (cgregan)
tags: removed: cdo-qa-blocker
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.