Node deployment breaks Nagios LMA LDAP auth

Bug #1632792 reported by Scott Machtmes
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
Undecided
LMA-Toolchain Fuel Plugins
8.0.x
Won't Fix
High
LMA-Toolchain Fuel Plugins
Mitaka
Won't Fix
High
LMA-Toolchain Fuel Plugins
Newton
Won't Fix
High
LMA-Toolchain Fuel Plugins
StackLight
Confirmed
Undecided
LMA-Toolchain Fuel Plugins

Bug Description

Env = Fuel 8.0

Using Fuel to deploy a new compute node breaks the LMA Nagios server. After successful node deployment, the nagios URL gives "500 Internal Server Error: Internal Server Error". We determined that the nagios ldap auth was broken by the deployment.

The nagios_error.log shows ldap error:

[Mon Sep 26 23:22:43.532222 2016] [authnz_ldap:info] [pid 11861] [client 10.15.96.12:46894] AH01695: auth_ldap authenticate: user jjania authentication failed; URI / [LDAP: ldap_simple_bind() failed][Can't contact LDAP server]

The puppet.log shows:

var/log/puppet.log:2016-09-26 18:53:59 +0000 /Stage[main]/Lma_logging_analytics::Elasticsearch/Elasticsearch::Instance[es-01]/File[/opt/es-data/elasticsearch_data/es-01] (err): Failed to generate additional resources using 'eval_generate': No such file or directory - /opt/es-data/elasticsearch_data/es-01/lma/nodes/0/indices/log-2016.09.26/4/index/_1b3s.fdt
var/log/puppet.log:/usr/lib/ruby/vendor_ruby/puppet/util.rb:496:in `exit_on_fail'
var/log/puppet.log:2016-09-26 19:16:44 +0000 Puppet (debug): Failed to load library 'ldap' for feature 'ldap'

The following are changes that were done to get it working on site after the deployment:

1) nagios VIP is now managed by a separate script ns_IPaddr2-nagios which includes those manual steps that were required to enable ldap connectivity for apache/nagios. These steps allow failover to work without admin's help.

2) ldap address was added to /etc/hosts to enable dns resolution (it didn't work from nagios network namespace). This might be broken (/etc/hosts rewritten) on each new compute deployment

3) Changes in /etc/apache2-nagios/conf.d/25-nagios-ui.conf

AuthLDAPURL line was changed to:

AuthLDAPURL "ldap://<br_monitoring_ip>:389/cn=Users,dc=ourcompany,dc=com?sAMAccountName?sub?(&(objectCategory=Person)(memberOf=cn=ourgroup,ou=Groups,dc=ourcompany,dc=com))"
  Where br_monitoring_ip is specific to each monitoring node.

Changed in fuel:
assignee: nobody → LMA-Toolchain Fuel Plugins (mos-lma-toolchain)
milestone: none → 8.0-updates
Revision history for this message
Scott Machtmes (smachtmes) wrote :

Hi folks, can anyone provide any update or ETA for next steps on this LMA/nagios customer reported bug? Thanks

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Did you make those customizations (ns_IPaddr2-nagios/25-nagios-ui.conf) before adding a new node or after?

Revision history for this message
Scott Machtmes (smachtmes) wrote :

Initially a node was added which then led to the failure. These changes were made after the deployment to get things working again.

Changed in lma-toolchain:
status: New → Confirmed
Revision history for this message
Simon Pasquier (simon-pasquier) wrote :

I'd like to have details about the configuration. In particular:
- do you use network templates to create a dedicated monitoring network?
- is the Nagios node also the Elasticsearch/Kibana and/or InfluxDB/Grafana?

The issue might be related to this bug: https://bugs.launchpad.net/lma-toolchain/+bug/1583994

Changed in lma-toolchain:
assignee: nobody → LMA-Toolchain Fuel Plugins (mos-lma-toolchain)
Revision history for this message
Scott Machtmes (smachtmes) wrote :

Simon, I wasn't involved in the deploy/issue directly. I've sent message to customer with your questions.

Revision history for this message
Scott Machtmes (smachtmes) wrote :

Yes a network template was used for the deployment.
They have 3 HA stand-alone monitoring nodes. Nagios should be on those infrastructure nodes also.

I'm attaching the network template for reference also.

Revision history for this message
Simon Pasquier (simon-pasquier) wrote :

Sorry for the delayed response... IIUC the root cause of the problem is that the nagios VIP is located on a dedicated network (br-monitoring) and with network templates, it isn't possible to assign a default gateway to this network. Also if the nagios VIP were on the management network, it would be fixed by the upcoming 1.0 version of the plugin [1].

Back to the problem described here, the fix is non-trivial and won't be fixed in 1.0 unfortunately. We'll reconsider it for the release coming after 1.0.

[1] https://bugs.launchpad.net/lma-toolchain/+bug/1583994

Revision history for this message
Patrick Petit (patrick-michel-petit) wrote :

To add more context on top of what Simon said. In SL 1.0 the public network is the default network for the SL UIs (Nagios web, Grafana and Kibana) for which a default gateway can be defined. Then the deployer will have the option to attach the SL services (Elasticsearch, InfluxDB and Nagios server onto either the management network or a dedicated network using network templates. Therefore, there will two independent VIPs. One for the UIs and one for the backend servers. Using that configuration should solve the networking issues highlighted above.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Won't fix since nobody works on it.

Changed in fuel:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.