Fuel for OpenStack

[bvt] deployment fails because rabbitmq-server does not start after 1800 seconds

Bug #1589464 reported by Peter Razumovsky on 2016-06-06

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Dmitry Mescheryakov	Fuel for OpenStack 10.0
	Mitaka	Opinion	High	Dmitry Mescheryakov	Fuel for OpenStack 9.1

Bug Description

Detailed bug description:
https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.bvt_2/271/console failed with traceback:

======================================================================
FAIL: Deploy ceph HA with RadosGW for objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 121, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/tests/test_ceph.py", line 504, in ceph_rados_gw
    self.fuel_web.deploy_cluster_wait(cluster_id)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 458, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 443, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 494, in wrapper
    return func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 501, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 385, in wrapper
    return func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 906, in deploy_cluster_wait
    self.assert_task_success(task, interval=interval, timeout=timeout)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/__init__.py", line 59, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 330, in assert_task_success
    task["name"], task['status'], 'ready', _message(task)
AssertionError: Task 'deploy' has incorrect status. error != ready, 'Deployment has failed. All nodes are finished. Failed tasks: Task[primary-rabbitmq/1], Task[memcached/1], Task[openrc-delete/1], Task[apache/1], Task[keystone-db/1] Stopping the deployment process!'

--------------------------
syslog file has next info:
--------------------------
<43>Jun 5 19:03:58 node-1 rsyslogd: cannot create '/var/lib/haproxy/dev/log': No such file or directory

... and after that there are many failures for each node from 1 to 3 ...

<27>Jun 5 19:07:12 node-1 ocf-ns_IPaddr2: ERROR: exec of "undef" failed: No such file or directory
<27>Jun 5 19:07:12 node-1 ocf-ns_IPaddr2: ERROR: Error: an inet prefix is expected rather than "undef".

...

<129>Jun 5 19:07:54 node-1 haproxy[19410]: Backup Server mysqld/node-3 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jun 5 19:07:54 node-1 haproxy[19410]: proxy mysqld has no server available!

...

<129>Jun 5 19:08:25 node-1 haproxy[28595]: Server keystone-2/node-3 is DOWN, reason: Layer7 wrong status, code: 404, info: "Not Found", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jun 5 19:08:25 node-1 haproxy[28595]: proxy keystone-2 has no server available!

...

<129>Jun 5 19:08:56 node-1 haproxy[3867]: Server nova-novncproxy/node-1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jun 5 19:08:56 node-1 haproxy[3867]: proxy nova-novncproxy has no server available!

...

<27>Jun 5 19:09:10 node-1 ocf-mysql-wss: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
<27>Jun 5 19:09:15 node-1 ocf-ns_conntrackd: ERROR: Setup problem: couldn't find command: conntrackd
<27>Jun 5 19:09:25 node-1 ocf-ns_IPaddr2: ERROR: exec of "undef" failed: No such file or directory
<27>Jun 5 19:09:35 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Jun 5 19:09:37 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:12:28 node-1 ocf-ns_conntrackd: ERROR: Setup problem: couldn't find command: conntrackd
<27>Jun 5 19:12:50 node-1 ocf-mysql-wss: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:16910, this is a split-brain!
<27>Jun 5 19:12:50 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_monitor(): I'm a master, and my GTID: 10d04ca9-2b51-11e6-a2b6-b6e55b120563:13, which was not expected
<27>Jun 5 19:13:11 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<129>Jun 5 19:13:12 node-1 haproxy[3867]: Server mysqld/node-1 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 7ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
<27>Jun 5 19:13:13 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running

See original description

Tags:

Peter Razumovsky (prazumovsky) on 2016-06-06

description:

updated

Oleksiy Molchanov (omolchanov) on 2016-06-06

Changed in fuel:
milestone:	none → 9.0
assignee:	nobody → Oleksiy Molchanov (omolchanov)
importance:	Undecided → High
status:	New → In Progress

Oleksiy Molchanov (omolchanov) on 2016-06-06

Changed in fuel:
assignee:	Oleksiy Molchanov (omolchanov) → Fuel Sustaining (fuel-sustaining-team)
tags:	added: area-library

Revision history for this message

Bug Checker Bot (bug-checker) wrote on 2016-06-06: Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags:

added: need-info

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2016-06-06: Re: Deployment has failed. All nodes are finished. Failed tasks: Task[primary-rabbitmq/1], Task[memcached/1], Task[openrc-delete/1], Task[apache/1], Task[keystone-db/1] Stopping the deployment process!

rabbitmq failed to start, timedout after 1800 seconds.

2016-06-05 19:44:04 +0000 /Stage[main]/Rabbitmq::Service/Service[rabbitmq-server]/ensure (err): change from stopped to running failed: Execution timeout after 1800 seconds!

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2016-06-06:

https://ci.fuel-infra.org/view/ISO/job/10.0-community.main.ubuntu.bvt_2/208/ also had the same issue

summary:

- Deployment has failed. All nodes are finished. Failed tasks:
- Task[primary-rabbitmq/1], Task[memcached/1], Task[openrc-delete/1],
- Task[apache/1], Task[keystone-db/1] Stopping the deployment process!
+ [bvt] deployment fails because rabbitmq-server does not start after 1800
+ seconds

Dmitry Pyzhov (dpyzhov) on 2016-06-06

no longer affects:	fuel/newton
Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Sergii Golovatiuk (sgolovatiuk)

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2016-06-07:

The logs don't contain any useful information.

Changed in fuel:
status:	Confirmed → Incomplete

Andrey Maximov (maximov) on 2016-06-08

tags:

added: move-to-mu

Revision history for this message

Dmitry Klenov (dklenov) wrote on 2016-07-13:

Too long in incomplete state. Closing. @Peter - feel free to reopen once issue is reproduced again.

Changed in fuel:
status:	Incomplete → Invalid

Revision history for this message

Paarhati Ozkasgarli (ozkasgarli) wrote on 2016-07-24:

This bug is still reproachable.

Unfortunately I have not kept the log file but my deployment failed because rabbitmq-server had not started.

Than I have ssh to the node that produced the log and tried to start rabbitmq-server with service "rabbitmq-server start" command.

/var/log/rabbitmq/startup_log produced following:
Could not resolve hotname messaging-test-controller-01

Than I have checked /etc/hosts file:

# HEADER: This file was autogenerated at 2016-07-24 12:12:19 +0000
# HEADER: by puppet. While it can still be managed manually, it
# HEADER: is definitely not recommended.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
192.168.0.5 test-cinder-01.skyatlas.com node-15
192.168.0.3 messaging-test-compute-01.skyatlas.com messaging-node-14
192.168.0.6 messaging-test-compute-02.skyatlas.com messaging-node-18
192.168.0.4 messaging-test-compute-03.skyatlas.com messaging-node-19
192.168.0.4 test-compute-03.skyatlas.com node-19 test-compute-03
192.168.0.8 messaging-test-controller-03.skyatlas.com
192.168.0.7 test-controller-01.skyatlas.com node-16 test-controller-01
192.168.0.9 test-controller-02.skyatlas.com node-17 test-controller-02
192.168.0.9 messaging-test-controller-02.skyatlas.com
192.168.0.3 test-compute-01.skyatlas.com node-14 test-compute-01
192.168.0.8 test-controller-03.skyatlas.com node-20 test-controller-03
192.168.0.7 messaging-test-controller-01.skyatlas.com
192.168.0.5 messaging-test-cinder-01.skyatlas.com
192.168.0.6 test-compute-02.skyatlas.com node-18 test-compute-02.skyatlas

Revision history for this message

Paarhati Ozkasgarli (ozkasgarli) wrote on 2016-07-25:

--This is rest of the comment #6--

As you can see I have changed all the nodes names and host names. But in /etc/hosts file there is no entry for messaging-test-controller-01.

I think there is an simple error in puppet script that is responsible for changing the /etc/hosts files.

After I have reset the deployment environment, I have changed the host name of all nodes to their default state. (node-1, node-2 etc.)

Changing the host name of the nodes did it for me and installation finished without any error.

Sergii Golovatiuk (sgolovatiuk) on 2016-12-14

Changed in fuel:
assignee:	Sergii Golovatiuk (sgolovatiuk) → Dmitry Mescheryakov (dmitrymex)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.