Contrail 3.2.5: contrail-alarm-gen is in failed state in one control node

Bug #1728284 reported by Deepak Jeyaraman
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.2
Fix Committed
High
Sundaresan Rajangam
R4.0
Fix Committed
High
Sundaresan Rajangam
R4.1
Fix Committed
High
Sundaresan Rajangam
Trunk
Fix Committed
High
Sundaresan Rajangam

Bug Description

Installed 3.2.5 contrail on a 3 contrail config/control node + 2 compute node HA setup and noticed on the 2nd control node ,

root@ccra-16:~# contrail-status -d
== Contrail Control ==
supervisor-control: active
contrail-control active pid 39149, uptime 3 days, 0:26:14
contrail-control-nodemgr active pid 39148, uptime 3 days, 0:26:14
contrail-dns active pid 39150, uptime 3 days, 0:26:14
contrail-named active pid 39695, uptime 3 days, 0:26:12

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen failed Oct 26 04:37 PM <<<<<<<<<<<<<<<<<<
contrail-analytics-api initializing (UvePartitions:UVE-Aggregation[Partitions:0] connection down)pid 42120, uptime 2 days, 22:35:40
contrail-analytics-nodemgr active pid 44024, uptime 2 days, 22:27:10
contrail-collector active pid 5597, uptime 3 days, 0:24:50
contrail-query-engine active pid 5598, uptime 3 days, 0:24:50
contrail-snmp-collector active pid 5595, uptime 3 days, 0:24:50
contrail-topology active pid 5596, uptime 3 days, 0:24:50

== Contrail Config ==
supervisor-config: active
contrail-api:0 active pid 40745, uptime 3 days, 0:22:30
contrail-config-nodemgr active pid 40736, uptime 3 days, 0:22:30
contrail-device-manager active pid 40748, uptime 3 days, 0:22:30
contrail-discovery active pid 40742, uptime 3 days, 0:22:30
contrail-schema active pid 40751, uptime 3 days, 0:22:30
contrail-svc-monitor active pid 40754, uptime 3 days, 0:22:30
ifmap active pid 40739, uptime 3 days, 0:22:30

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active pid 19281, uptime 3 days, 0:23:37
contrail-webui-middleware active pid 19282, uptime 3 days, 0:23:37

== Contrail Database ==
contrail-database: active

== Contrail Supervisor Database ==
supervisor-database: active
contrail-database-nodemgr active pid 18935, uptime 3 days, 0:47:50
kafka active pid 11007, uptime 3 days, 0:28:56

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active pid 16468, uptime 1 day, 3:00:07

========Run time service failures=============
/var/crashes/core.contrail-collec.2283.ccra-16.1508956971

contrail-alarm-gen log:

10/26/2017 04:37:02 PM [contrail-alarm-gen]: Stopped http server
10/26/2017 04:37:02 PM [contrail-alarm-gen]: Session Event: TCP Connection Closed
10/26/2017 04:37:02 PM [contrail-alarm-gen]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = ccra-16 process_status = [ << module_id = contrail-alarm-gen instance_id = 0 state = Non-Functional connection_infos = [ << type = Redis-UVE name = 97.0.0.16:6379 server_addrs = [ ] status = Up >>, << type = Redis-UVE name = 97.0.0.14:6379 server_addrs = [ ] status = Up >>, << type = Collector name = server_addrs = [ 97.0.0.14:8086, ] status = Down description = Established to Idle on EvStop >>, << type = Database name = RabbitMQ server_addrs = [ 97.0.0.17:5672, 97.0.0.16:5672, 97.0.0.14:5672, ] status = Up description = >>, << type = Redis-UVE name = 97.0.0.17:6379 server_addrs = [ ] status = Up >>, << type = Zookeeper name = Zookeeper server_addrs = [ 97.0.0.17:2181, 97.0.0.16:2181, 97.0.0.14:2181, ] status = Up description = >>, << type = Discovery name = ApiServer server_addrs = [ 97.0.0.40:5998, ] status = Up description = Subscribe Response >>, << type = ApiServer name = Config server_addrs = [ 97.0.0.14:9100, ] status = Up description = >>, << type = Discovery name = Collector server_addrs = [ 97.0.0.40:5998, ] status = Up description = Subscribe Response >>, << type = Discovery name = AlarmGenerator server_addrs = [ 97.0.0.40:5998, ] status = Up description = Subscribe Response >>, << type = KafkaPub name = KafkaTopic server_addrs = [ ] status = Up >>, << type = Redis-UVE name = AggregateRedis server_addrs = [ ] status = Up >>, ] description = Collector connection down >>, ] >>
10/26/2017 04:37:02 PM [contrail-alarm-gen]: AlarmGen killing 1 of 4

======

zookeeper is up:

root@ccra-16:~# vim /var/log/contrail/contrail-analytics-api.log
root@ccra-16:~# netstat -anp | grep :2181
tcp 0 0 97.0.0.16:44947 97.0.0.14:2181 ESTABLISHED 40745/python
tcp 0 0 97.0.0.16:36978 97.0.0.16:2181 ESTABLISHED 40748/python
tcp 0 0 97.0.0.16:44904 97.0.0.14:2181 ESTABLISHED 40754/python
tcp 0 0 97.0.0.16:43254 97.0.0.14:2181 ESTABLISHED 26953/python
tcp 0 0 97.0.0.16:37487 97.0.0.14:2181 ESTABLISHED 5596/python
tcp 0 0 97.0.0.16:56627 97.0.0.16:2181 ESTABLISHED 40751/python
tcp 0 0 97.0.0.16:45021 97.0.0.14:2181 ESTABLISHED 5595/python
tcp 0 0 97.0.0.16:48389 97.0.0.16:2181 ESTABLISHED 45245/python
tcp6 0 0 :::2181 :::* LISTEN 24778/java
tcp6 0 0 97.0.0.16:2181 97.0.0.14:60365 ESTABLISHED 24778/java
tcp6 0 0 97.0.0.16:2181 97.0.0.16:36978 ESTABLISHED 24778/java
tcp6 0 0 97.0.0.16:2181 97.0.0.14:34532 ESTABLISHED 24778/java
tcp6 0 0 97.0.0.16:2181 97.0.0.17:39574 ESTABLISHED 24778/java
tcp6 0 0 97.0.0.16:2181 97.0.0.16:41098 ESTABLISHED 24778/java
tcp6 0 0 97.0.0.16:2181 97.0.0.14:54138 ESTABLISHED 24778/java
tcp6 0 0 97.0.0.16:2181 97.0.0.16:48389 ESTABLISHED 24778/java
tcp6 0 0 97.0.0.16:2181 97.0.0.16:56627 ESTABLISHED 24778/java
tcp6 0 0 97.0.0.16:41098 97.0.0.16:2181 ESTABLISHED 11007/java
tcp6 0 0 97.0.0.16:2181 97.0.0.17:39536 ESTABLISHED 24778/java

====

=====

Setup:

10.102.28.138, 10.102.28.116, 10.102.28.139 (all config nodes)
ccra-13, ccra-12 are compute nodes.

root@ccra-17:~# contrail-version
Package Version Build-ID | Repo | Package Name
-------------------------------------- ------------------------------ ----------------------------------
contrail-analytics 3.2.5.0-51 51
contrail-config 3.2.5.0-51 51
contrail-config-openstack 3.2.5.0-51 51
contrail-control 3.2.5.0-51 51
contrail-database-common 3.2.5.0-51 51
contrail-dns 3.2.5.0-51 51
contrail-docs 3.2.5.0-51 51
contrail-f5 3.2.5.0-51 51
contrail-fabric-utils 3.2.5.0-51 51

Tags: analytics dt
information type: Proprietary → Public
Jeba Paulaiyan (jebap)
tags: added: analytics
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/37030
Submitter: Sundaresan Rajangam (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.1

Review in progress for https://review.opencontrail.org/37031
Submitter: Sundaresan Rajangam (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/37032
Submitter: Sundaresan Rajangam (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/37033
Submitter: Sundaresan Rajangam (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/37031
Committed: http://github.com/Juniper/contrail-controller/commit/8cafc49a398577e08a58aa6e9cc195af36cce859
Submitter: Zuul (<email address hidden>)
Branch: R4.1

commit 8cafc49a398577e08a58aa6e9cc195af36cce859
Author: Sundaresan Rajangam <email address hidden>
Date: Tue Oct 31 14:51:03 2017 -0700

Fix exit code in contrail-alarm-gen

If contrail-alarm-gen exits with code 0, then supervisor doesn't restart
the service. Hence, set the exit code to 2.

Change-Id: Ic77d7ada5b22db1e3061d6fa04505bb7ab31d3d5
Closes-Bug: #1728284

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/37033
Committed: http://github.com/Juniper/contrail-controller/commit/1906b68cdec2fc604ac85153a4a031f9dd6aa0ef
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit 1906b68cdec2fc604ac85153a4a031f9dd6aa0ef
Author: Sundaresan Rajangam <email address hidden>
Date: Tue Oct 31 14:51:03 2017 -0700

Fix exit code in contrail-alarm-gen

If contrail-alarm-gen exits with code 0, then supervisor doesn't restart
the service. Hence, set the exit code to 2.

Change-Id: Ic77d7ada5b22db1e3061d6fa04505bb7ab31d3d5
Closes-Bug: #1728284

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/37032
Committed: http://github.com/Juniper/contrail-controller/commit/4a2dba6aeee41edfd8f04a99804203cb29e7937d
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit 4a2dba6aeee41edfd8f04a99804203cb29e7937d
Author: Sundaresan Rajangam <email address hidden>
Date: Tue Oct 31 14:51:03 2017 -0700

Fix exit code in contrail-alarm-gen

If contrail-alarm-gen exits with code 0, then supervisor doesn't restart
the service. Hence, set the exit code to 2.

Change-Id: Ic77d7ada5b22db1e3061d6fa04505bb7ab31d3d5
Closes-Bug: #1728284

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/37030
Committed: http://github.com/Juniper/contrail-controller/commit/e5cda869aabc1f4e385bca78a1a4ae53a97f03d1
Submitter: Zuul (<email address hidden>)
Branch: master

commit e5cda869aabc1f4e385bca78a1a4ae53a97f03d1
Author: Sundaresan Rajangam <email address hidden>
Date: Tue Oct 31 14:51:03 2017 -0700

Fix exit code in contrail-alarm-gen

If contrail-alarm-gen exits with code 0, then supervisor doesn't restart
the service. Hence, set the exit code to 2.

Change-Id: Ic77d7ada5b22db1e3061d6fa04505bb7ab31d3d5
Closes-Bug: #1728284

tags: added: dt
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.