[R4.1.1.0-10-newton] Alarm test cases fails intermittently due to all services not coming up

Bug #1769117 reported by alok kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.1
Invalid
High
alok kumar
R5.0
Invalid
High
alok kumar
Trunk
Invalid
High
alok kumar

Bug Description

Alarm cases fails in sanity intermittently because all services could not come up.
however test case passed in rerun and not easily reproducible, we need to debug it from script as well as feature perspective to root cause the issue.

Failed cases due to this are:

AnalyticsTestSanity.test_analytics_node_process_status_alarms
AnalyticsTestSanity.test_cfgm_node_process_status_alarms
AnalyticsTestSanity.test_control_node_process_status_alarms
AnalyticsTestSanity.test_db_node_process_status_alarms
AnalyticsTestSanity.test_vrouter_process_status_alarms
TestBasicVMVN0.test_process_restart_in_policy_between_vns

Revision history for this message
alok kumar (kalok) wrote :

its failing in another setup too and collector is not coming up now manually too on nodec7.
have locked the setup to debug by analytics team further.

root@nodec7(analytics):/var/log/contrail# contrail-status
== Contrail Analytics ==
contrail-collector: inactive
contrail-analytics-api: initializing (Redis-UVE:192.168.192.6:6381[None] connection down)
contrail-query-engine: active
contrail-alarm-gen: initializing (Redis-UVE:192.168.192.6:6381[None] connection down)
contrail-snmp-collector: active
contrail-topology: active
contrail-analytics-nodemgr: active

root@nodec8(analytics):/# contrail-status
== Contrail Analytics ==
contrail-collector: active
contrail-analytics-api: initializing (Redis-UVE:192.168.192.6:6381[None] connection down)
contrail-query-engine: active
contrail-alarm-gen: initializing (Redis-UVE:192.168.192.6:6381[None] connection down)
contrail-snmp-collector: active
contrail-topology: active
contrail-analytics-nodemgr: active

root@nodec57(analytics):/# contrail-status
== Contrail Analytics ==
contrail-collector: active
contrail-analytics-api: initializing (Redis-UVE:192.168.192.6:6381[None] connection down)
contrail-query-engine: active
contrail-alarm-gen: initializing (Redis-UVE:192.168.192.6:6381[None] connection down)
contrail-snmp-collector: active
contrail-topology: active
contrail-analytics-nodemgr: active

test cases effected:
AnalyticsTestSanity.test_analytics_node_process_status_alarms
AnalyticsTestSanity.test_cfgm_node_process_status_alarms
AnalyticsTestSanity.test_control_node_process_status_alarms
AnalyticsTestSanity.test_db_node_process_status_alarms
AnalyticsTestSanity.test_vrouter_process_status_alarms
TestBasicVMVN0.test_process_restart_in_policy_between_vns
TestXmpptests.test_precedence_xmpp_auth
TestXmpptests.test_undo_xmpp_auth

sanity report: http://10.204.216.50/Docs/logs/4.1.1.0-10_jenkins-SMLite_ubuntu-14-04_mitaka_Openstack_HA_Sanity-510_1525436506.82/junit-noframes.html

setup details:

Config Nodes : [u'nodec7', u'nodec8', u'nodec57']
Control Nodes : [u'nodec7', u'nodec8', u'nodec57']
Compute Nodes : [u'nodei1', u'nodei2', u'nodei3']
Openstack Node : [u'nodec7', u'nodec8', u'nodec57']
WebUI Node : [u'nodec7', u'nodec8', u'nodec57']
Analytics Nodes : [u'nodec7', u'nodec8', u'nodec57']
Database Nodes : [u'nodec7', u'nodec8', u'nodec57']
Physical Devices : [u'hooper', u"'hooper'"]
LB Nodes : [u'nodeg36']

tags: removed: automation
Revision history for this message
Sundaresan Rajangam (srajanga) wrote :

systemctl restart/start seem to have been stuck for some reason.
killing these processes and starting contrail-collector service worked
root@nodec7(analytics):/# ps -ef | grep collector
root 2501 2263 0 21:48 ? 00:00:00 systemctl restart contrail-collector.service
root 2741 2558 0 21:49 ? 00:00:00 grep --color=auto collector
contrail 3416 1 0 12:55 ? 00:02:30 /usr/bin/contrail-collector
contrail 3464 1 0 12:55 ? 00:00:26 /usr/bin/python /usr/bin/contrail-snmp-collector
root 12627 0 0 15:07 ? 00:00:00 systemctl start contrail-collector.service

My bad, I didn't realize contrail-collector (3416) service was running when I restarted the collector service. Please get the gcore of the collector service when you hit this issue again

Revision history for this message
alok kumar (kalok) wrote :

Have not seen this issue again in sanity, will reopen when reproduced again.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.