host definition not created when multiple principals are hulk smashed

Bug #1531487 reported by Andreas Hasenack
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Landscape Server
Fix Released
Undecided
Chad Smith
nrpe (Juju Charms Collection)
New
Undecided
Unassigned

Bug Description

I have a case where a machine has two principal services that are hulk smashed and related to the nrpe subordinate: ceph-osd and nova-compute.

When this happens, nrpe will not create a host definition for the multiple principals, but just one:

root@node-5:/var/lib/nagios/export# ll /var/lib/nagios/export/
total 24
drwxr-xr-x 2 root root 4096 Jan 5 20:26 ./
drwxr-xr-x 3 nagios nagios 4096 Jan 5 20:14 ../
-r--r--r-- 1 root root 264 Jan 5 20:26 host__region1-ceph-osd-0.cfg
-rw-r--r-- 1 root root 458 Jan 5 20:16 service__region1-ceph-osd-0_check_ceph-osd.cfg
-rw-r--r-- 1 root root 476 Jan 5 20:19 service__region1-nova-compute-1_check_libvirt-bin.cfg
-rw-r--r-- 1 root root 478 Jan 5 20:19 service__region1-nova-compute-1_check_nova-compute.cfg

It's missing a definition for region1-nova-compute-1. As a result, nagios on the nagios master fails to start:
(...)
Processing object config file '/etc/nagios3/cloud.d/service__region1-neutron-api-0_check_haproxy_servers.cfg'...
Error: Could not find any host matching 'region1-nova-compute-1' (config file '/etc/nagios3/cloud.d/service__region1-nova-compute-1_check_libvirt-bin.cfg', starting on line 5)
Error: Could not expand hostgroups and/or hosts specified in service (config file '/etc/nagios3/cloud.d/service__region1-nova-compute-1_check_libvirt-bin.cfg', starting on line 5)
   Error processing object config files!

Full juju status attached.

node-5 is 10.245.202.13

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

juju get nrpe

description: updated
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
description: updated
tags: added: bug-squad kanban
description: updated
description: updated
Chad Smith (chad.smith)
tags: removed: kanban
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I'll try a deployment with setting nagios_hostname_type=host in the nrpe charm.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This didn't work. I got these files in /var/lib/nagios/export in the neutron-api/0 unit, for example:
host__juju-machine-1-lxc-1.cfg
service__region1-neutron-api-0_check_apache2.cfg
service__region1-neutron-api-0_check_haproxy.cfg
service__region1-neutron-api-0_check_haproxy_queue.cfg
service__region1-neutron-api-0_check_haproxy_servers.cfg
service__region1-neutron-api-0_check_neutron-server.cfg

The host one has these contents:
define host {
    address 10.245.201.104
    host_name juju-machine-1-lxc-1
    use server
    hostgroups machines,
}

The service ones, though, don't use that host_name. For example, service__region1-neutron-api-0_check_neutron-server.cfg:
define service {
    use active-service
    host_name region1-neutron-api-0
    service_description region1-neutron-api-0[neutron-server] process check {neutron-api/0}
    check_command check_nrpe!check_neutron-server
    servicegroups region1
}

So service startup is still failing on the master:
Processing object config file '/etc/nagios3/cloud.d/service__region1-neutron-api-0_check_haproxy_servers.cfg'...
Error: Could not find any host matching 'region1-neutron-api-0' (config file '/etc/nagios3/cloud.d/service__region1-neutron-api-0_check_haproxy_servers.cfg', starting on line 5)
Error: Could not expand hostgroups and/or hosts specified in service (config file '/etc/nagios3/cloud.d/service__region1-neutron-api-0_check_haproxy_servers.cfg', starting on line 5)
   Error processing object config files!

Chad Smith (chad.smith)
tags: added: osa-nagios
removed: bug-squad
Chad Smith (chad.smith)
Changed in nrpe (Juju Charms Collection):
assignee: nobody → Chad Smith (chad.smith)
status: New → In Progress
Revision history for this message
Chad Smith (chad.smith) wrote :

I've got a lead on a fix for this. NRPE needs to surface a couple of values in the nrpe-external-master relation. the principle charms which use charmhelpers are already watching for nrpe-external-master:nagios_hostname and nagios_host_context. We just need to provide it to them from nrpe.

Revision history for this message
Chad Smith (chad.smith) wrote :

My mistake, I was actually working toward a fix for lp:1532281.

Landscape doesn't have to worry about this issue with hulksmashing, because we can instead juju set nrpe nagios_host_type="host" instead of "unit".

This way all hulk smashed unit will add service files referenced by the same host_name instead of the service-specific unit name.

Once lp:1532281 is fixed, service__ files will properly match the hostname instead of unit name and all it good on our side.

Changed in nrpe (Juju Charms Collection):
status: In Progress → New
assignee: Chad Smith (chad.smith) → nobody
Chad Smith (chad.smith)
Changed in landscape:
assignee: nobody → Chad Smith (chad.smith)
status: New → In Progress
tags: added: kanban
tags: removed: kanban
Changed in landscape:
milestone: none → 16.02
status: In Progress → Fix Committed
Changed in landscape:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.