rabbitmq-server takes a long time to process amqp-relation-changed

Bug #1660000 reported by Francis Ginther
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Autopilot Log Analyser
Fix Committed
High
Francis Ginther
Landscape Server
New
Undecided
Francis Ginther
rabbitmq-server (Juju Charms Collection)
Invalid
High
David Ames

Bug Description

I'm seeing this issue with the current next charm:
cs:~openstack-charmers-next/xenial/rabbitmq-server-247

Deployed by landscape autopilot (17.01~bzr10851+jenkins3406-2) using all -next charms for the entire deployment. This was seen in CI several times:
https://ci.lscape.net/job/landscape-system-tests-costack/641
https://ci.lscape.net/job/landscape-system-tests-costack/640
https://ci.lscape.net/job/landscape-system-tests-costack/639
https://ci.lscape.net/job/landscape-system-tests-costack/652

The attached logs are extracted from build 641.

It appears that the rabbitmq charm is taking a long time to process amqp-relation-joined hooks. According to the 'ps' output captured in the dump the most recent hook started at 23:45 and was still running when the output was recorded at 23:56:

[landscape-0-inner-logs/rabbitmq-server-0/var/log/ps-fauxww.txt]
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
...
ubuntu 10561 0.0 0.0 36232 3372 ? R 23:56 0:00 \_ ps fauxww
...
root 9567 0.0 0.0 19700 3264 ? Ss 22:47 0:00 bash /var/lib/juju/init/jujud-unit-rabbitmq-server-
0/exec-start.sh
root 9571 0.0 0.3 301872 58020 ? Sl 22:47 0:03 \_ /var/lib/juju/tools/unit-rabbitmq-server-0/juju
d unit --data-dir /var/lib/juju --unit-name rabbitmq-server/0 --debug
root 4421 1.7 0.3 132120 63264 ? S 23:45 0:10 \_ /usr/bin/python /var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/amqp-relation-changed
root 10398 0.0 0.0 4508 1672 ? S 23:56 0:00 \_ /bin/sh /usr/sbin/rabbitmqctl list_vhosts
root 10406 0.0 0.0 49344 3160 ? S 23:56 0:00 \_ su rabbitmq -s /bin/sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "list_vhosts"
rabbitmq 10415 0.0 0.0 4508 700 ? Ss 23:56 0:00 \_ sh -c /usr/lib/rabbitmq/bin/rabbitmqctl "list_vhosts"
rabbitmq 10417 0.0 0.1 377080 17612 ? Sl 23:56 0:00 \_ /usr/lib/erlang/erts-7.3/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.7/sbin/../ebin -noshell -noinput -hidden -boot start_clean -sasl errlog_type error -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@juju-ce6020-4-lxd-3" -s rabbit_control_main -nodename rabbit@juju-ce6020-4-lxd-3 -extra list_vhosts

The juju unit log for rabbitmq also show consistent processing of amqp relation hooks up until the dump is collected. And in this deployment, nova-cloud-compute is still waiting on it's messaging relation to be completed.

The autopilot deployment failed because the landscape-client subordinate running on the same unit as rabbitmq-server/0 never got a chance to complete and communicate back to landscape. It appears that juju was only ever running rabbitmq-server hooks (note: juju hooks running on a unit are serialized, so it's possible that the landscape-client hooks are queued up waiting for the rabbitmq-server hooks to finish).

I've attached the logs from the rabbitmq-server and nova-cloud-compute units.

Tags: landscape
Revision history for this message
Francis Ginther (fginther) wrote :
Changed in landscape:
milestone: none → 17.01
Revision history for this message
Francis Ginther (fginther) wrote :

I don't have the exact juju configuration used at the time of the failure, but the only option that landscape appears to be tweaking is "min-cluster-size: 3" and "nagios_context: region1". All the other options should be set to the default.

David Ames (thedac)
Changed in rabbitmq-server (Juju Charms Collection):
status: New → Triaged
importance: Undecided → High
assignee: nobody → David Ames (thedac)
milestone: none → 17.01
Revision history for this message
David Ames (thedac) wrote :

Francis,

The problem we were tring to solve was also a Landscape bug: https://bugs.launchpad.net/charms/+source/rabbitmq-server/+bug/1657245

This commit addresses bug#1657245
https://github.com/openstack/charm-rabbitmq-server/commit/2472e1ca9fbd68cdd50640eadd071177ddc5968e

The approach I took was to hold all amqp-relation executions until the cluster is complete. Once the cluster is complete the leader node runs amqp-relatoin(s) to all the clients and then subsequently the non-leader nodes do the same. This takes some time.

Are other charms failing with this longer set up time or is this a test design issue? Can the landscape-client subordinate charm handle this longer set up time?

Changed in rabbitmq-server (Juju Charms Collection):
status: Triaged → Incomplete
Revision history for this message
David Ames (thedac) wrote :

Francis,

One more thing from the logs on rabbitmq-server/0

The landscape client fails because the charm has not been downloaded to disk:

2017-01-26 22:56:23 ERROR juju.worker.dependency engine.go:547 "metric-collect" manifold worker returned unexpected error: failed to read charm from: /var/lib/juju/agents/unit-landscape-client-32/charm: stat /var/lib/juju/agents/unit-landscape-client-32/charm: no such file or directory

That is unrelated to rabbitmq-server. That seems like a juju bug.

Revision history for this message
Francis Ginther (fginther) wrote :
Download full text (3.3 KiB)

The 'slowness' appears to be related to the login service. When rabbitmq-server starts processing amqp-relation-hook events, rabbitmqctl responds in 1 sec or less:

2017-01-26 23:00:08 DEBUG juju-log amqp:64: Running ['/usr/sbin/rabbitmqctl', 'set_permissions', '-p', u'openstack', u'ceilometer', '.*', '.*', '.*']
2017-01-26 23:00:09 INFO amqp-relation-changed Setting permissions for user "ceilometer" in vhost "openstack" ...

But then, they start taking ~20-30 seconds:

2017-01-26 23:01:37 DEBUG juju-log amqp:48: Running ['/usr/sbin/rabbitmqctl', 'set_permissions', '-p', u'openstack', u'ceilometer', '.*', '.*', '.*']
2017-01-26 23:02:03 INFO amqp-relation-changed Setting permissions for user "ceilometer" in vhost "openstack" ...

At the same time, we start seeing login service activation messages in /var/log/syslog:

Jan 26 23:01:37 juju-ce6020-4-lxd-3 dbus[292]: [system] Activating via systemd: service name='org.freedesktop.login1
' unit='dbus-org.freedesktop.login1.service'
Jan 26 23:01:37 juju-ce6020-4-lxd-3 systemd[1]: Started Login Service.
Jan 26 23:02:03 juju-ce6020-4-lxd-3 dbus[292]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jan 26 23:02:03 juju-ce6020-4-lxd-3 systemd[1]: Started Login Service.
Jan 26 23:02:39 juju-ce6020-4-lxd-3 dbus[292]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jan 26 23:02:39 juju-ce6020-4-lxd-3 systemd[1]: Started Login Service.
Jan 26 23:03:05 juju-ce6020-4-lxd-3 dbus[292]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jan 26 23:03:05 juju-ce6020-4-lxd-3 systemd[1]: Started Login Service.

And it looks like this impairs rabbitmq's ability to switch users (from /var/log/auth.log):

Jan 26 23:01:37 juju-ce6020-4-lxd-3 su[14693]: Successful su for rabbitmq by root
Jan 26 23:01:37 juju-ce6020-4-lxd-3 su[14693]: + ??? root:rabbitmq
Jan 26 23:01:37 juju-ce6020-4-lxd-3 su[14693]: pam_unix(su:session): session opened for user rabbitmq by (uid=0)
Jan 26 23:02:02 juju-ce6020-4-lxd-3 dbus[292]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jan 26 23:02:02 juju-ce6020-4-lxd-3 su[14693]: pam_systemd(su:session): Failed to create session: Failed to activate service 'org.freedesktop.login1': timed out
Jan 26 23:02:03 juju-ce6020-4-lxd-3 su[14693]: pam_unix(su:session): session closed for user rabbitmq
Jan 26 23:02:03 juju-ce6020-4-lxd-3 su[14822]: Successful su for rabbitmq by root
Jan 26 23:02:03 juju-ce6020-4-lxd-3 su[14822]: + ??? root:rabbitmq
Jan 26 23:02:03 juju-ce6020-4-lxd-3 su[14822]: pam_unix(su:session): session opened for user rabbitmq by (uid=0)
Jan 26 23:02:28 juju-ce6020-4-lxd-3 su[14822]: pam_systemd(su:session): Failed to create session: Connection timed out
Jan 26 23:02:28 juju-ce6020-4-lxd-3 dbus[292]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jan 26 23:02:28 juju-ce6020-4-lxd-3 su[14822]: pam_unix(su:session): session closed for user rabbitmq

The login service issues appear to be causing the slowness in the rabbitmq-server charm. A succes...

Read more...

Revision history for this message
Francis Ginther (fginther) wrote :

Found lp:1591411 as a possible root cause. The patched version of dbus, 1.10.6-1ubuntu3.3, is being applied via a package upgrade during cloud-init. However, it appears that dbus is not being restarted, so we are still hitting the original bug. Will continue to monitor for this once that version of the package shows up in the image itself.

Changed in landscape:
assignee: nobody → Francis Ginther (fginther)
Changed in autopilot-log-analyser:
assignee: nobody → Francis Ginther (fginther)
status: New → In Progress
importance: Undecided → High
summary: - rabbitmq-server takes a long time to process amqp-relation-joined
+ rabbitmq-server takes a long time to process amqp-relation-changed
Revision history for this message
Adam Collard (adam-collard) wrote :

@thedac re: comment #4

>The landscape client fails because the charm has not been downloaded to disk:
>
> 2017-01-26 22:56:23 ERROR juju.worker.dependency engine.go:547 "metric-collect" manifold worker returned unexpected error: failed to read charm from: /var/lib/juju/agents/unit-landscape-client-32/charm: stat /var/lib/juju/agents/unit-landscape-client-32/charm: no such file or directory
>
> That is unrelated to rabbitmq-server. That seems like a juju bug.

That's just noise in Juju logs, it's #1656188

Revision history for this message
Adam Collard (adam-collard) wrote :
Changed in autopilot-log-analyser:
status: In Progress → Fix Committed
Chad Smith (chad.smith)
Changed in landscape:
milestone: 17.01 → 17.02
Revision history for this message
Francis Ginther (fginther) wrote :

@thedac, dbus 1.10.6-1ubuntu3.3 has now landed in the xenial lxd images and I no longer see this bug or any of the symptoms. I think this is good to close now.

Changed in rabbitmq-server (Juju Charms Collection):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.