[bvt] deployment fails because rabbitmq-server does not start after 1800 seconds

Bug #1589464 reported by Peter Razumovsky
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Dmitry Mescheryakov
Mitaka
Opinion
High
Dmitry Mescheryakov

Bug Description

Detailed bug description:
 https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.bvt_2/271/console failed with traceback:

======================================================================
FAIL: Deploy ceph HA with RadosGW for objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 121, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/tests/test_ceph.py", line 504, in ceph_rados_gw
    self.fuel_web.deploy_cluster_wait(cluster_id)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 458, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 443, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 494, in wrapper
    return func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 501, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 385, in wrapper
    return func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 906, in deploy_cluster_wait
    self.assert_task_success(task, interval=interval, timeout=timeout)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/__init__.py", line 59, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 330, in assert_task_success
    task["name"], task['status'], 'ready', _message(task)
AssertionError: Task 'deploy' has incorrect status. error != ready, 'Deployment has failed. All nodes are finished. Failed tasks: Task[primary-rabbitmq/1], Task[memcached/1], Task[openrc-delete/1], Task[apache/1], Task[keystone-db/1] Stopping the deployment process!'

--------------------------
syslog file has next info:
--------------------------
<43>Jun 5 19:03:58 node-1 rsyslogd: cannot create '/var/lib/haproxy/dev/log': No such file or directory

... and after that there are many failures for each node from 1 to 3 ...

<27>Jun 5 19:07:12 node-1 ocf-ns_IPaddr2: ERROR: exec of "undef" failed: No such file or directory
<27>Jun 5 19:07:12 node-1 ocf-ns_IPaddr2: ERROR: Error: an inet prefix is expected rather than "undef".

...

<129>Jun 5 19:07:54 node-1 haproxy[19410]: Backup Server mysqld/node-3 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jun 5 19:07:54 node-1 haproxy[19410]: proxy mysqld has no server available!

...

<129>Jun 5 19:08:25 node-1 haproxy[28595]: Server keystone-2/node-3 is DOWN, reason: Layer7 wrong status, code: 404, info: "Not Found", check duration: 1ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jun 5 19:08:25 node-1 haproxy[28595]: proxy keystone-2 has no server available!

...

<129>Jun 5 19:08:56 node-1 haproxy[3867]: Server nova-novncproxy/node-1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Jun 5 19:08:56 node-1 haproxy[3867]: proxy nova-novncproxy has no server available!

...

<27>Jun 5 19:09:10 node-1 ocf-mysql-wss: ERROR: Setup problem: couldn't find command: /usr/bin/mysqld_safe
<27>Jun 5 19:09:15 node-1 ocf-ns_conntrackd: ERROR: Setup problem: couldn't find command: conntrackd
<27>Jun 5 19:09:25 node-1 ocf-ns_IPaddr2: ERROR: exec of "undef" failed: No such file or directory
<27>Jun 5 19:09:35 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Jun 5 19:09:37 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:10:12 node-1 multipathd: uevent trigger error
<27>Jun 5 19:12:28 node-1 ocf-ns_conntrackd: ERROR: Setup problem: couldn't find command: conntrackd
<27>Jun 5 19:12:50 node-1 ocf-mysql-wss: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:16910, this is a split-brain!
<27>Jun 5 19:12:50 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_monitor(): I'm a master, and my GTID: 10d04ca9-2b51-11e6-a2b6-b6e55b120563:13, which was not expected
<27>Jun 5 19:13:11 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<129>Jun 5 19:13:12 node-1 haproxy[3867]: Server mysqld/node-1 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 7ms. 0 active and 2 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
<27>Jun 5 19:13:13 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running

description: updated
Changed in fuel:
milestone: none → 9.0
assignee: nobody → Oleksiy Molchanov (omolchanov)
importance: Undecided → High
status: New → In Progress
Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Fuel Sustaining (fuel-sustaining-team)
tags: added: area-library
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Alex Schultz (alex-schultz) wrote : Re: Deployment has failed. All nodes are finished. Failed tasks: Task[primary-rabbitmq/1], Task[memcached/1], Task[openrc-delete/1], Task[apache/1], Task[keystone-db/1] Stopping the deployment process!

rabbitmq failed to start, timedout after 1800 seconds.

2016-06-05 19:44:04 +0000 /Stage[main]/Rabbitmq::Service/Service[rabbitmq-server]/ensure (err): change from stopped to running failed: Execution timeout after 1800 seconds!

Revision history for this message
Alex Schultz (alex-schultz) wrote :
summary: - Deployment has failed. All nodes are finished. Failed tasks:
- Task[primary-rabbitmq/1], Task[memcached/1], Task[openrc-delete/1],
- Task[apache/1], Task[keystone-db/1] Stopping the deployment process!
+ [bvt] deployment fails because rabbitmq-server does not start after 1800
+ seconds
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/newton
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Sergii Golovatiuk (sgolovatiuk)
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

The logs don't contain any useful information.

Changed in fuel:
status: Confirmed → Incomplete
Andrey Maximov (maximov)
tags: added: move-to-mu
Revision history for this message
Dmitry Klenov (dklenov) wrote :

Too long in incomplete state. Closing. @Peter - feel free to reopen once issue is reproduced again.

Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
Paarhati Ozkasgarli (ozkasgarli) wrote :

This bug is still reproachable.

Unfortunately I have not kept the log file but my deployment failed because rabbitmq-server had not started.

Than I have ssh to the node that produced the log and tried to start rabbitmq-server with service "rabbitmq-server start" command.

/var/log/rabbitmq/startup_log produced following:
Could not resolve hotname messaging-test-controller-01

Than I have checked /etc/hosts file:

# HEADER: This file was autogenerated at 2016-07-24 12:12:19 +0000
# HEADER: by puppet. While it can still be managed manually, it
# HEADER: is definitely not recommended.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
192.168.0.5 test-cinder-01.skyatlas.com node-15
192.168.0.3 messaging-test-compute-01.skyatlas.com messaging-node-14
192.168.0.6 messaging-test-compute-02.skyatlas.com messaging-node-18
192.168.0.4 messaging-test-compute-03.skyatlas.com messaging-node-19
192.168.0.4 test-compute-03.skyatlas.com node-19 test-compute-03
192.168.0.8 messaging-test-controller-03.skyatlas.com
192.168.0.7 test-controller-01.skyatlas.com node-16 test-controller-01
192.168.0.9 test-controller-02.skyatlas.com node-17 test-controller-02
192.168.0.9 messaging-test-controller-02.skyatlas.com
192.168.0.3 test-compute-01.skyatlas.com node-14 test-compute-01
192.168.0.8 test-controller-03.skyatlas.com node-20 test-controller-03
192.168.0.7 messaging-test-controller-01.skyatlas.com
192.168.0.5 messaging-test-cinder-01.skyatlas.com
192.168.0.6 test-compute-02.skyatlas.com node-18 test-compute-02.skyatlas

Revision history for this message
Paarhati Ozkasgarli (ozkasgarli) wrote :

--This is rest of the comment #6--

As you can see I have changed all the nodes names and host names. But in /etc/hosts file there is no entry for messaging-test-controller-01.

I think there is an simple error in puppet script that is responsible for changing the /etc/hosts files.

After I have reset the deployment environment, I have changed the host name of all nodes to their default state. (node-1, node-2 etc.)

Changing the host name of the nodes did it for me and installation finished without any error.

Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Dmitry Mescheryakov (dmitrymex)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.