OpenStack RabbitMQ Server Charm

Upgrade 16.07 -> 16.10 breaks on node name

Bug #1710247 reported by Peter Sabaini on 2017-08-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack RabbitMQ Server Charm	Triaged	Medium	Unassigned

Bug Description

When upgrading rabbitmq-server from 16.07 to 16.10 I'm getting an error in wait_app()

Due to the changes introduced in Change-Id: I105eb2684e61a553a52c5a944e8c562945e2a6eb (cf. Bug #1584902) the nodename of a rabbitmq node is expected to equal socket.gethostname().

However, units reverse dns reso resolves to another name, and in the cluster they're known by that name.

Hostnames, DNS reso:

$ juju run --unit rabbitmq-server/3 'hostname ; unit-get private-address ; dig +short -x $( unit-get private-address )'
...
juju-machine-1-lxc-14
10.76.12.252
10-76-12-252.maas.

Rabbit nodes are known by the second, maas generated name:

$ u=rabbitmq-server/3;r=cluster; juju run --unit $u "relation-ids $r| xargs -I_@ sh -c 'relation-list -r _@|xargs -I_U sh -c \"relation-get -r _@ - _U |sed s,^,_U:, 2>&1\"'" | grep clustered
rabbitmq-server/4:clustered: 10-76-12-236
rabbitmq-server/5:clustered: 10-76-12-245

When running upgrade-charm the wait_app func expects the pid file in the wrong place b/c of this:

Reading package lists...
Waiting for 'rabbit@10-76-12-252' ...
pid is 13134 ...
Error: process_not_running
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/upgrade-charm", line 709, in <module>
    rabbit.assess_status(rabbit.ConfigRenderer(rabbit.CONFIG_FILES))
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 809, in assess_status
    assess_status_func(configs)()
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 833, in _assess_status_func
    services=services(), ports=None)
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1178, in _determine_os_workload_status
    state, message, lambda: charm_func(configs))
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1306, in _ows_check_charm_func
    charm_state, charm_message = charm_func_with_configs()
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1178, in <lambda>
    state, message, lambda: charm_func(configs))
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 744, in assess_cluster_status
    ret = wait_app()
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 361, in wait_app
    raise ex
subprocess.CalledProcessError: Command '['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']' returned non-zero exit status 2
2017-08-11 10:54:39 ERROR juju.worker.uniter.operation runhook.go:107 hook "upgrade-charm" failed: exit status 1

Other functions that depend on the clustername to equal socket.gethostname() will likely fail too, eg. is_leader()

Juju: 1.25.10

See original description

Tags:

Peter Sabaini (peter-sabaini) on 2017-08-14

description:

updated

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2017-08-21:

I found that the charm upgrade would complete (and other hooks that call the status check) after creating a symlink /<email address hidden> to <email address hidden> files on all rabbit units. Obviously not a scalable solution.

How best can we handle this corner case when there are multiple reverse DNS entries in a repeatable manner pre and post 16.10? I also checked the 17.02 code and skipping a rev won't help this issue. It seems odd to lookup the hostname for a pid filename instead of checking config files or rabbitmqctl command outputs. for instance, the rabbitmqctl wait <pidfile> command shows "Waiting for 'rabbit@ip-ad-dr-es'" in the log file (and when run manually) as you can see in Peter's log.

# rabbitmqctl wait /var/lib/rabbitmq/mnesia/rabbit\@10-76-13-12.pid
Waiting for 'rabbit@10-76-13-12' ...
pid is 16537 ...
(exit code 0)

From what I can tell following the code:

- in 16.07 wait_app uses get_local_nodename() to determine PID filename
   which in turn calls get_host_ip(unit_get('private-address')) which in turn calls
   get_node_hostname that either uses get_hostname(ip_addr) (coming from
   charmhelpers.contrib.openstack.utils) or falls back to socket.gethostname()
- charmhelpers.contrib.openstack.utils.get_hostname calls
   charmhelpers.contrib.network.ip.get_hostname which in turn either
   runs dns.reversename.from_address(address) or fails back to
   socket.gethostbyaddr(address)[0]
- Noting from lp:1710247 ref to lp:1484902 that this is intentional
   for maas2 support.

Perhaps in upgrade-charm, if pid file from hostname code fails, return code should be checked and command output should be used to find the previous pid file name to use and then add a name change routine to re-configure the server and cluster relationships.

Chris MacNaughton (chris.macnaughton) on 2017-10-04

Changed in charm-rabbitmq-server:
status:	New → Triaged
importance:	Undecided → Medium

Alex Kavanagh (ajkavanagh) on 2019-11-08

tags:

added: charm-upgrade

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.