Upgrade 16.07 -> 16.10 breaks on node name

Bug #1710247 reported by Peter Sabaini
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Triaged
Medium
Unassigned

Bug Description

When upgrading rabbitmq-server from 16.07 to 16.10 I'm getting an error in wait_app()

Due to the changes introduced in Change-Id: I105eb2684e61a553a52c5a944e8c562945e2a6eb (cf. Bug #1584902) the nodename of a rabbitmq node is expected to equal socket.gethostname().

However, units reverse dns reso resolves to another name, and in the cluster they're known by that name.

Hostnames, DNS reso:

$ juju run --unit rabbitmq-server/3 'hostname ; unit-get private-address ; dig +short -x $( unit-get private-address )'
...
juju-machine-1-lxc-14
10.76.12.252
10-76-12-252.maas.

Rabbit nodes are known by the second, maas generated name:

$ u=rabbitmq-server/3;r=cluster; juju run --unit $u "relation-ids $r| xargs -I_@ sh -c 'relation-list -r _@|xargs -I_U sh -c \"relation-get -r _@ - _U |sed s,^,_U:, 2>&1\"'" | grep clustered
rabbitmq-server/4:clustered: 10-76-12-236
rabbitmq-server/5:clustered: 10-76-12-245

When running upgrade-charm the wait_app func expects the pid file in the wrong place b/c of this:

Reading package lists...
Waiting for 'rabbit@10-76-12-252' ...
pid is 13134 ...
Error: process_not_running
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/upgrade-charm", line 709, in <module>
    rabbit.assess_status(rabbit.ConfigRenderer(rabbit.CONFIG_FILES))
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 809, in assess_status
    assess_status_func(configs)()
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 833, in _assess_status_func
    services=services(), ports=None)
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1178, in _determine_os_workload_status
    state, message, lambda: charm_func(configs))
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1306, in _ows_check_charm_func
    charm_state, charm_message = charm_func_with_configs()
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1178, in <lambda>
    state, message, lambda: charm_func(configs))
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 744, in assess_cluster_status
    ret = wait_app()
  File "/var/lib/juju/agents/unit-rabbitmq-server-3/charm/hooks/rabbit_utils.py", line 361, in wait_app
    raise ex
subprocess.CalledProcessError: Command '['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']' returned non-zero exit status 2
2017-08-11 10:54:39 ERROR juju.worker.uniter.operation runhook.go:107 hook "upgrade-charm" failed: exit status 1

Other functions that depend on the clustername to equal socket.gethostname() will likely fail too, eg. is_leader()

Juju: 1.25.10

description: updated
Revision history for this message
Drew Freiberger (afreiberger) wrote :

I found that the charm upgrade would complete (and other hooks that call the status check) after creating a symlink /<email address hidden> to <email address hidden> files on all rabbit units. Obviously not a scalable solution.

How best can we handle this corner case when there are multiple reverse DNS entries in a repeatable manner pre and post 16.10? I also checked the 17.02 code and skipping a rev won't help this issue. It seems odd to lookup the hostname for a pid filename instead of checking config files or rabbitmqctl command outputs. for instance, the rabbitmqctl wait <pidfile> command shows "Waiting for 'rabbit@ip-ad-dr-es'" in the log file (and when run manually) as you can see in Peter's log.

# rabbitmqctl wait /var/lib/rabbitmq/mnesia/rabbit\@10-76-13-12.pid
Waiting for 'rabbit@10-76-13-12' ...
pid is 16537 ...
(exit code 0)

From what I can tell following the code:

 - in 16.07 wait_app uses get_local_nodename() to determine PID filename
   which in turn calls get_host_ip(unit_get('private-address')) which in turn calls
   get_node_hostname that either uses get_hostname(ip_addr) (coming from
   charmhelpers.contrib.openstack.utils) or falls back to socket.gethostname()
 - charmhelpers.contrib.openstack.utils.get_hostname calls
   charmhelpers.contrib.network.ip.get_hostname which in turn either
   runs dns.reversename.from_address(address) or fails back to
   socket.gethostbyaddr(address)[0]
 - Noting from lp:1710247 ref to lp:1484902 that this is intentional
   for maas2 support.

Perhaps in upgrade-charm, if pid file from hostname code fails, return code should be checked and command output should be used to find the previous pid file name to use and then add a name change routine to re-configure the server and cluster relationships.

Changed in charm-rabbitmq-server:
status: New → Triaged
importance: Undecided → Medium
tags: added: charm-upgrade
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.