Fuel for OpenStack

ntpdate run before p_ntp pacemaker resource is up

Series 7.0.x
Bug #1572521

Bug #1572521 reported by Alexey Stepanov on 2016-04-20

This bug affects 4 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	High	Fuel QA Team	Fuel for OpenStack 10.0
7.0.x	Fix Committed	High	Alexander Kurenyshev	Fuel for OpenStack 7.0-updates
Mitaka	Invalid	High	Fuel QA Team	Fuel for OpenStack 9.1

Bug Description

Steps to reproduce:
1. Create cluster
2. Add 3 nodes with controller and ceph OSD roles
3. Add 1 node with ceph OSD roles
4. Add 2 nodes with compute and ceph OSD roles
5. Deploy the cluster
6. Cold reboot all nodes in the cluster

Expected result:
The cluster is working, all pcs resources are up after nodes startup

Actual result:
The luster is broken. All resources are down.
Presumably the main problem is outage of resources on VM nodes. When we have add memory to the controller nodes mysql enabled performance mode and got all free memory again.
Such behavior doesn't appear on the baremetal lab with 64GB memory on controller nodes.

This leads to the test [1] error with ntpdate, because all resources are down including ntp one:

2016-04-20 03:50:42,371 - ERROR decorators.py:126 -- Traceback (most recent call last):
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.thread_3/fuelweb_test/helpers/decorators.py", line 120, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.thread_3/fuelweb_test/tests/tests_strength/test_restart.py", line 165, in ceph_ha_restart
    'slave-04']))
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.thread_3/fuelweb_test/models/fuel_web_client.py", line 1860, in cold_restart_nodes
    self.environment.sync_time()
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.thread_3/fuelweb_test/models/environment.py", line 137, in sync_time
    new_time = sync_time(self.d_env, nodes_names, skip_sync)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/retry.py", line 27, in wrapper
    return func(*args, **kwargs)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/ntp.py", line 45, in sync_time
    g_ntp.do_sync_time(g_ntp.other_ntps)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/ntp.py", line 134, in do_sync_time
    .format(self.report_not_synchronized(ntps)))
TimeoutError: Time on nodes was not set with 'ntpdate':
[(u'slave-04', ['Wed Apr 20 03:46:22 UTC 2016\n'])]

This error couldn't be fixed in the test with smart waiting for ntp resource. When tried to reproduce the cluster was broken even after one day waiting.

[1] https://github.com/openstack/fuel-qa/blob/391b1219ebfa5f3d136054931076cbd03644c6eb/fuelweb_test/tests/tests_strength/test_restart.py#L90

See original description

Tags:

Revision history for this message

Dmitry Klenov (dklenov) wrote on 2016-04-20:

Alexey, please attach snapshot if possible.

tags:

added: area-library

Revision history for this message

Alexey Stepanov (astepanov-m) wrote on 2016-04-20:

Snapshot Edit (73.4 MiB, application/octet-stream)

Revision history for this message

Valeriy Saharov (vsakharov) wrote on 2016-05-04:

Bug cannot be reproduced on the latest iso. Also, this error don't appear on the next and other builds (from 84 to current) of the 9.0.system_test.ubuntu.thread_3 job (https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.thread_3/).
This error appeared, if ceph-osd node don't reply more than 10 min.
In the ceph_health on the node-4 was warning about it:
"[10.109.20.4] out: HEALTH_WARN clock skew detected on mon.node-5; too many PGs per OSD (499 > max 300); Monitor clock skew detected".

tags:

added: tricky

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-05-10:

Please reopen the bug if you reproduce it

no longer affects:	fuel/newton
Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Valeriy Saharov (vsakharov)

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-05-16:

Bug reproduced again, compute node (slave-04 wasn't synchronized)

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-05-16:

fail_error_ceph_ha_restart-fuel-snapshot-2016-05-15_02-16-50.tar.xz Edit (65.1 MiB, application/octet-stream)

Changed in fuel:
status:	Invalid → New

Matthew Mosesohn (raytrac3r) on 2016-05-17

Changed in fuel:
status:	New → Confirmed

Sergey Shevorakov (sshevorakov) on 2016-05-18

tags:

added: swarm-blocker

Dmitry Klenov (dklenov) on 2016-05-19

Changed in fuel:
assignee:	Valeriy Saharov (vsakharov) → Michael Polenchuk (mpolenchuk)

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2016-05-20:

@sshevorakov, I wonder why this bug is swarm blocker?
I believe this ain't common usecase to shutdown all the controllers and then boot them back.
ETA: unknown

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2016-05-20:

Almost all of services in down state after "cold" shutdown -> boot.

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2016-05-20:

Please make sure the tests are properly waiting for services start.
Not reproduced by manual actions therefore this might be considered as "invalid" or re-assigned to qa-team in order to fix the tests.

Michael Polenchuk (mpolenchuk) on 2016-05-20

Changed in fuel:
assignee:	Michael Polenchuk (mpolenchuk) → Fuel QA Team (fuel-qa)

Dmitry Pyzhov (dpyzhov) on 2016-05-20

tags:

added: area-qa

Alexander Kurenyshev (akurenyshev) on 2016-05-23

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Alexander Kurenyshev (akurenyshev)

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2016-05-26:

#10

This bug is really hard to reproduce, will try to do that and wait for it on CI

Revision history for this message

Artem Hrechanychenko (agrechanichenko) wrote on 2016-05-27:

#11

reproduced:
https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.thread_3/122/testReport/%28root%29/ceph_ha_restart/ceph_ha_restart/

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2016-06-01:

#12

Download full text (5.9 KiB)

So, as we see at the log, fuel-devops couldn't sync time on some node, and this node is not controller.
The difference between syncing time on controllers and other nodes is an NTP host. For controller it is some remote host and for other nodes it is the NTP server which runs on virtual namespace managed by Pacemaker at vip__vrouter resource.

I tried to sync time on compute:

root@node-5:~# ntpdate -d 10.109.1.8
1 Jun 10:31:29 ntpdate[32556]: ntpdate 4.2.6p5@1.2349-o Thu Feb 11 18:30:41 UTC 2016 (1)
Looking for host 10.109.1.8 and service ntp
host found : 10.109.1.8
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
10.109.1.8: Server dropped: no data
server 10.109.1.8, port 123
stratum 0, precision 0, leap 00, trust 000
refid [10.109.1.8], delay 0.00000, dispersion 64.00000
transmitted 4, in filter 4
reference time: 00000000.00000000 Mon, Jan 1 1900 0:00:00.000
originate timestamp: 00000000.00000000 Mon, Jan 1 1900 0:00:00.000
transmit timestamp: daf93807.6a5ce4c3 Wed, Jun 1 2016 10:31:35.415
filter delay: 0.00000 0.00000 0.00000 0.00000
0.00000 0.00000 0.00000 0.00000
filter offset: 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000
delay 0.00000, dispersion 64.00000
offset 0.000000

1 Jun 10:31:37 ntpdate[32556]: no server suitable for synchronization found

Then I tried to get know what's wrong with NTP in namespace, on controller:
root@node-6:~# ip netns exec vrouter netstat -nap|grep :123

So, there is no alive NTP instance

Pacemaker status:
root@node-6:~# pcs status
Cluster name:
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Last updated: Wed Jun 1 11:46:43 2016 Last change: Tue May 31 12:27:14 2016 by root via cibadmin on node-2.test.domain.local
Stack: corosync
Current DC: node-6.test.domain.local (version 1.1.14-70404b0) - partition with quorum
3 nodes and 46 resources configured

Online: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]

Full list of resources:

Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-6.test.domain.local
vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-6.test.domain.local
vip__public (ocf::fuel:ns_IPaddr2): Stopped
Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
Clone Set: clone_p_mysqld [p_mysqld]
     Stopped: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
sysinfo_node-2.test.domain.local (ocf::pacemaker:SysInfo): Started node-2.test.domain.local
sysinfo_node-6.test.domain.local (ocf::pacemaker:SysInfo): Started node-6.test.domain.local
Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-2.test.domain.local ]
     Slaves: [ node-6.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ ...

I tried to sync time on compute:

root@node-5:~# ntpdate -d 10.109.1.8
 1 Jun 10:31:29 ntpdate[32556]: ntpdate 4.2.6p5@1.2349-o Thu Feb 11 18:30:41 UTC 2016 (1)
Looking for host 10.109.1.8 and service ntp
host found : 10.109.1.8
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
10.109.1.8: Server dropped: no data
server 10.109.1.8, port 123
stratum 0, precision 0, leap 00, trust 000
refid [10.109.1.8], delay 0.00000, dispersion 64.00000
transmitted 4, in filter 4
reference time:    00000000.00000000  Mon, Jan  1 1900  0:00:00.000
originate timestamp: 00000000.00000000  Mon, Jan  1 1900  0:00:00.000
transmit timestamp:  daf93807.6a5ce4c3  Wed, Jun  1 2016 10:31:35.415
filter delay:  0.00000  0.00000  0.00000  0.00000 
         0.00000  0.00000  0.00000  0.00000 
filter offset: 0.000000 0.000000 0.000000 0.000000
         0.000000 0.000000 0.000000 0.000000
delay 0.00000, dispersion 64.00000
offset 0.000000

1 Jun 10:31:37 ntpdate[32556]: no server suitable for synchronization found

Then I tried to get know what's wrong with NTP in namespace, on controller:
root@node-6:~# ip netns exec vrouter netstat -nap|grep :123

So, there is no alive NTP instance

Online: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]

Full list of resources:

Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
 vip__management	(ocf::fuel:ns_IPaddr2):	Started node-1.test.domain.local
 vip__vrouter_pub	(ocf::fuel:ns_IPaddr2):	Started node-6.test.domain.local
 vip__vrouter	(ocf::fuel:ns_IPaddr2):	Started node-6.test.domain.local
 vip__public	(ocf::fuel:ns_IPaddr2):	Stopped
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
 Clone Set: clone_p_mysqld [p_mysqld]
     Stopped: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
 sysinfo_node-2.test.domain.local	(ocf::pacemaker:SysInfo):	Started node-2.test.domain.local
 sysinfo_node-6.test.domain.local	(ocf::pacemaker:SysInfo):	Started node-6.test.domain.local
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-2.test.domain.local ]
     Slaves: [ node-6.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-2.test.domain.local node-6.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 sysinfo_node-1.test.domain.local	(ocf::pacemaker:SysInfo):	Stopped
 Clone Set: clone_neutron-openvswitch-agent [neutron-openvswitch-agent]
     Started: [ node-2.test.domain.local node-6.test.domain.local ]
 Clone Set: clone_neutron-l3-agent [neutron-l3-agent]
     Started: [ node-6.test.domain.local ]
     Stopped: [ node-2.test.domain.local ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Stopped: [ node-6.test.domain.local ]
 Clone Set: clone_neutron-metadata-agent [neutron-metadata-agent]
 Clone Set: clone_neutron-dhcp-agent [neutron-dhcp-agent]
 Clone Set: clone_p_ntp [p_ntp]
 Clone Set: clone_ping_vip__public [ping_vip__public]

Failed Actions:
* p_mysqld_start_0 on node-1.test.domain.local 'unknown error' (1): call=168, status=Timed Out, exitreason='none',
    last-rc-change='Tue May 31 13:02:47 2016', queued=0ms, exec=300002ms
* neutron-openvswitch-agent_monitor_20000 on node-6.test.domain.local 'unknown error' (1): call=93, status=Timed Out, exitreason='none',
    last-rc-change='Tue May 31 12:53:36 2016', queued=0ms, exec=0ms
* p_mysqld_monitor_60000 on node-6.test.domain.local 'unknown error' (1): call=41, status=complete, exitreason='none',
    last-rc-change='Tue May 31 12:54:24 2016', queued=0ms, exec=0ms
* neutron-l3-agent_monitor_20000 on node-6.test.domain.local 'unknown error' (1): call=100, status=Timed Out, exitreason='none',
    last-rc-change='Tue May 31 12:53:30 2016', queued=0ms, exec=0ms
* sysinfo_node-6.test.domain.local_monitor_15000 on node-6.test.domain.local 'unknown error' (1): call=47, status=Timed Out, exitreason='none',
    last-rc-change='Tue May 31 12:53:28 2016', queued=0ms, exec=0ms
* neutron-openvswitch-agent_monitor_20000 on node-2.test.domain.local 'unknown error' (1): call=96, status=Timed Out, exitreason='none',
    last-rc-change='Tue May 31 12:54:35 2016', queued=0ms, exec=0ms
* p_mysqld_start_0 on node-2.test.domain.local 'unknown error' (1): call=326, status=Timed Out, exitreason='none',
    last-rc-change='Wed Jun  1 11:37:52 2016', queued=0ms, exec=300004ms
* sysinfo_node-2.test.domain.local_monitor_15000 on node-2.test.domain.local 'unknown error' (1): call=47, status=Timed Out, exitreason='none',
    last-rc-change='Tue May 31 12:54:18 2016', queued=0ms, exec=0ms

PCSD Status:
  node-1.test.domain.local member (10.109.1.4): Offline
  node-2.test.domain.local member (10.109.1.3): Offline
  node-6.test.domain.local member (10.109.1.2): Offline

There are many failed actions.
We need to investigate what exactly prevent us to normally run NTP demon.
This is not QA bug, I've analyze all our code related to the problem and we couldn't fix it on our side.

Changed in fuel:
assignee:	Alexander Kurenyshev (akurenyshev) → Fuel Sustaining (fuel-sustaining-team)

Oleksiy Molchanov (omolchanov) on 2016-06-02

Changed in fuel:
milestone:	10.0 → 9.0
milestone:	9.0 → 10.0
assignee:	Fuel Sustaining (fuel-sustaining-team) → Oleksiy Molchanov (omolchanov)

Sergey Shevorakov (sshevorakov) on 2016-06-02

tags:

added: swarm-fail
removed: swarm-blocker

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-06-02:

#13

The bug appears only on several automated test cases with reverts of environment snapshots. We need live environment in order to investigate it deeper. We cannot commit to fix the bug by HCF and I propose to move it out from the release. Marking as incomplete until we have an environment.

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-06-06:

#14

This problem reproduces not only after snapshot revert but also after cluster restart.
Env is available during next 3-4 hours

Changed in fuel:
status:	Incomplete → Confirmed

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-06-06:

#15

fail_error_ceph_partitions_repetitive_cold_restart-fuel-snapshot-2016-06-06_09-20-51.tar.gz Edit (118.4 MiB, application/x-tar)

Dmitry Pyzhov (dpyzhov) on 2016-06-06

tags:

added: move-to-mu

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-06-07:

#16

The problem is that you are doing time_sync after small amount of time node is online. It works for controller nodes, but non-controller nodes are trying to reach ntp server that is managed by pacemaker. It fails because of ntp pacemaker resource is not started yet, it need more time.

I think you should sync time with master node for all nodes. Or wait for ntp resource is up, but I prefere first approach.

Changed in fuel:
assignee:	Oleksiy Molchanov (omolchanov) → Fuel QA Team (fuel-qa)

Revision history for this message

Sergey Shevorakov (sshevorakov) wrote on 2016-06-07:

#17

Moving back to 9.0 release since it is a swarm-blocker.

Alexander Kurenyshev (akurenyshev) on 2016-06-08

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Alexander Kurenyshev (akurenyshev)

Revision history for this message

Dmitry (dtsapikov) wrote on 2016-06-09:

#18

The problem was reproduced on 7.0
fail_error_ha_one_controller_backup_restore-fuel-snapshot-2016-06-08_17-34-06.tar.xz

Revision history for this message

Dmitry (dtsapikov) wrote on 2016-06-09:

#19

fail_error_ha_one_controller_backup_restore-fuel-snapshot-2016-06-08_17-34-06.tar.xz Edit (37.1 MiB, application/octet-stream)

Revision history for this message

Vadim Rovachev (vrovachev) wrote on 2016-06-09:

#20

Addition to comment #18, 7.0 and 8.0 swarm tests uses devops version 2.9.19, so this bug was reproduced on this version also

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-06-09: Related fix proposed to fuel-infra/jenkins-jobs (master)

#21

Related fix proposed to branch: master
Change author: Alexander Kurenyshev <email address hidden>
Review: https://review.fuel-infra.org/21894

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2016-06-09: Re: Time on nodes was not set with 'ntpdate'

#22

Guys, the problem is not in Devops and its revert mechanism.
The main problem is insufficient resources on nodes, cpu LA is 6, processes start to use swap.
As we decided with V. Kuklin, we have to increase memory for slave nodes. And we'll see if this be helpful.
Patch above do that staff.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-06-09: Related fix merged to fuel-infra/jenkins-jobs (master)

#23

Reviewed: https://review.fuel-infra.org/21894
Submitter: Ivan Remizov <email address hidden>
Branch: master

Commit: 0492e2c99c879b3e1c76b587e794801980ccd668
Author: Alexander Kurenyshev <email address hidden>
Date: Thu Jun 9 12:41:28 2016

Increase slave node memory for destructive ceph

This patch increases memory for slave nodes for
all related jobes from 7, 8, 9 and 10 SWARM

Change-Id: Ibed77dd9e20264c24907e9e699dfa7a24a5e642c
Related-Bug: #1572521

Sergey Shevorakov (sshevorakov) on 2016-06-10

tags:

added: swarm-blocker
removed: swarm-fail

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2016-06-10: Re: Time on nodes was not set with 'ntpdate'

#24

Fix above has solved the problem. For now test is passed.

Changed in fuel:
status:	Confirmed → Fix Committed

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-06-13:

#25

Issue still reproduced, logs are attached

Changed in fuel:
status:	Fix Committed → Confirmed

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2016-06-13:

#26

fail_error_ceph_ha_restart-fuel-snapshot-2016-06-13_05-40-09.tar.gz Edit (82.2 MiB, application/x-tar)

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2016-06-14:

#27

According to my comment #22 we've increased memory on all slaves. Test was passed on custom CI, but this wasn't helpful at swarm.
I think we should continue an investigating on developers side. And fix performance issues with slow pcs resources start after nodes cold restart.

Changed in fuel:
assignee:	Alexander Kurenyshev (akurenyshev) → Fuel Sustaining (fuel-sustaining-team)

Revision history for this message

Alexandr Kostrikov (akostrikov-mirantis) wrote on 2016-06-14:

#28

Seems that real problem which we should fix is:
```The problem is that you are doing time_sync after small amount of time node is online. It works for controller nodes, but non-controller nodes are trying to reach ntp server that is managed by pacemaker. It fails because of ntp pacemaker resource is not started yet, it need more time.

I think you should sync time with master node for all nodes. Or wait for ntp resource is up, but I prefere first approach.```

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-06-14:

#29

Alexander Kurenyshev, please address the recommendations, this bug should be fixed in the tests.

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Alexander Kurenyshev (akurenyshev)
tags:	added: area-python removed: area-library

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2016-06-14:

#30

Maksim, please, say to me what exactly we should fix here?
1. We have a function which is used everywhere to sync time and it works perfect
2. I tried to do a smart wait for ntp resources are up and it took 60 minutes without success - resources didn't start, cluster was dead. Also I leave my env for a day and after that resources still were down.

This is a performance issue, and we couldn't avoid it in our tests.
V. Kuklin suggested to add another 1.5 GB of memory. Okay, let's see what will happen.

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-06-14:

#31

@Alexander Kurenyshev:
1. it didn't work perfect because it broke the swarm, and also, I sometimes see the same issue on debug-jobs when trying to revert snapshots.
2. maybe smart wait will help, but as a failover, we need to sync the cluster with the master node.

I agree that this is the performance issue, but it reproduced only on tests with fuel-devops, so should be fixed in the tests.

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-06-15:

#32

SWARM on ISO #485 gave only one reproduce of this failure, tag swarm-blocker was removed.

tags:

removed: area-qa swarm-blocker

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-06-15: Related fix proposed to fuel-infra/jenkins-jobs (master)

#33

Related fix proposed to branch: master
Change author: Alexander Kurenyshev <email address hidden>
Review: https://review.fuel-infra.org/22098

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-06-16: Re: Time on nodes was not set with 'ntpdate'

#34

I had a talk with Dmitry Ilyin and made testings. So we have 2 approaches for fixing this bug:

1) Non-controller nodes should sync time with the same ntp-server as controllers
2) You should wait before pacemaker resource p_ntp is up and after that sync time

Personally I prefere 1-st.

Why can't it be fixed in library? In your case pacemaker needs a lot of time to bring up all resources in cluster and ntp resource is the last one in this list. We cannot move it to the top of the list because it has lots of dependencies, like p_vrouter, vip_mgmt and deployment by itself (this resource is created in the post-deployment, that is why it is in the bottom). Even if we could make it start first, you must have some kind of test to check whether it is up, synced and can provide time.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-06-16: Related fix merged to fuel-infra/jenkins-jobs (master)

#35

Reviewed: https://review.fuel-infra.org/22098
Submitter: Lesya Novaselskaya <email address hidden>
Branch: master

Commit: 00e76af6307135d7b507c80a94264c036c717687
Author: Alexander Kurenyshev <email address hidden>
Date: Wed Jun 15 17:06:31 2016

Increase memory for tests with cold restart

Fix adds 1.5Gb memory to the tests which use
cold restart function

Change-Id: Ic05004c3d2a96d4e4f4178d3530112350c981430
Related-Bug:#1572521

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2016-06-27: Re: Time on nodes was not set with 'ntpdate'

#36

2 tests failed with this issue on 10.0:

fuel-10.0-mitaka-19-2016-06-26
https://product-ci.infra.mirantis.net/view/10.0_swarm/job/10.0.system_test.ubuntu.services_reconfiguration_thread_1/74/testReport/junit/(root)/preservation_config_after_reset_and_preconfigured_deploy/preservation_config_after_reset_and_preconfigured_deploy/

https://product-ci.infra.mirantis.net/view/10.0_swarm/job/10.0.system_test.ubuntu.services_reconfiguration_thread_1/74/testReport/junit/(root)/reconfigure_nova_ephemeral_disk/reconfigure_nova_ephemeral_disk/

Dmitry Pyzhov (dpyzhov) on 2016-06-30

tags:

removed: area-python

Sergey Shevorakov (sshevorakov) on 2016-07-19

tags:

added: swarm-fail

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2016-07-20:

#37

@Maksim Malchuk, @Oleksiy Molchanov
Guys are you joking? Or you really think that this product fail could be fixed in tests?

1) Smart fail couldn't help us, because overall cluster resources are broken, and there is no way to wait for ntp daemon becomes ready. I've implemented it in tests, I tried to wait for that and I was waiting for 2 hours without success.
2) `Non-controller nodes should sync time with the same ntp-server as controllers` approach couldn't working by the same reason: the cluster IS BROKEN. There is no difference between servers for sync, because ntp is not running.

The main problem is mysql performance scheme. We add some amount of memory and it is quickly consumed by mysql. Other resources are in SWAP.
We have never faced this error on the baremetal lab with 64 GB of memory per controller.

@Ksenia Svechnikova
I couldn't get know what happened with that job [1] I need a live environment. Other, the latest fails were not related to this bug. There are problems with setup master node.

[1] https://product-ci.infra.mirantis.net/view/10.0_swarm/job/10.0.system_test.ubuntu.services_reconfiguration_thread_1/

Changed in fuel:
assignee:	Alexander Kurenyshev (akurenyshev) → Fuel Sustaining (fuel-sustaining-team)

Revision history for this message

Maksim Malchuk (mmalchuk) wrote on 2016-07-20:

#38

@Alexander Kurenyshev, you are kidding? Please reread all the comments, along with yours:
1. The bug appears only on several automated test cases with reverts of environment snapshots.
2. The bug never appears on standard deployment on any hardware/virtualization.
3. The bug never appears on the BVT or Smoke_Neutron tests.
According to the provided points, the issue should be fixed ONLY in the "several automated tests cases".
As for the place and method to fix, indeed, adding the timeouts in the tests might not help, so need to review the deployment options for each case, and "several automated tests cases" shouldn't share the same settings to deploy (for example comment #34 should be addressed).

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Alexander Kurenyshev (akurenyshev)

Alexander Kurenyshev (akurenyshev) on 2016-07-20

summary:

- Time on nodes was not set with 'ntpdate'
+ Pacemaker resources are failed to start after the nodes cold reboot

Alexander Kurenyshev (akurenyshev) on 2016-07-20

description:	updated
Changed in fuel:
assignee:	Alexander Kurenyshev (akurenyshev) → Fuel Sustaining (fuel-sustaining-team)

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-07-20: Re: Pacemaker resources are failed to start after the nodes cold reboot

#39

@Aleksandr, I want to comment under this:

---
2) `Non-controller nodes should sync time with the same ntp-server as controllers` approach couldn't working by the same reason: the cluster IS BROKEN. There is no difference between servers for sync, because ntp is not running.
---
For initial time sync we do not need any cluster to be working. In your test-case you are syncing controller nodes with master and non-controller nodes with ntp under pacemaker on controllers. Why don't you sync it with master too?

What I observed under my env where nodes have 3G and 1 CPU:

It took ~10 minutes to bring up ntp resource in pacemaker, after that I can easily run ntpdate pointing it to vrouter_vip.

http://paste.openstack.org/show/538844/

So, my suggestion is still the same:
>>> I think you should sync time with master node for all nodes. Or wait for ntp resource is up, but I prefere first approach.

summary:	- Pacemaker resources are failed to start after the nodes cold reboot + ntpdate run before p_ntp pacemaker resource is up
Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Fuel QA Team (fuel-qa)

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2016-07-21:

#40

Similar error:

NTPD on nodes was not synchronized:
[(u'slave-01', 'Node: slave-01, ntpd peers: []', []), (u'slave-02', 'Node: slave-02, ntpd peers: []', []), (u'slave-03', 'Node: slave-03, ntpd peers: []', [])]

SWARMs:

https://mirantis.testrail.com/index.php?/runs/view/16451
https://mirantis.testrail.com/index.php?/tests/view/9752358

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-07-22:

#41

I confirm the issue is reproduced again in swarm
File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/ntp.py", line 134, in do_sync_time
.format(self.report_not_synchronized(ntps)))
TimeoutError: Time on nodes was not set with 'ntpdate':
[(u'slave-04', "Execution of the command 'ntpdate -p 4 -t 0.2 -bu 10.109.51.8\n' failed on the node slave-04 with exit_code = 1 : 22 Jul 04:22:18 ntpdate[8534]: no server suitable for synchronization found\n", ['Fri Jul 22 04:22:39 UTC 2016\n'])]
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_3/3/testReport/(root)/ceph_ha_restart/ceph_ha_restart/

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2016-07-27:

#42

We might want to leverage ntp-wait as part of this process as it will wait for the ntpd to stabilze before continuing. The issue is that upon first startup, the ntp system has to stabilize before the servers become available. This is just how ntp works. We need to wait until the system has stabilized before we can expect the time to be set.

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-08-01:

#43

Similar error (comment #40) on ci
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.services_ha.ceilometer/13/testReport/(root)/ceilometer_ha_multirole_add_mongo/
https://mirantis.testrail.com/index.php?/tests/view/10388427

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-08-04:

#44

Moved to Incomplete, because there was no this issue in a week.

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2016-08-24:

#45

Moved to invalid due to absent of reproduces.
Feel free to reopen with logs and snapshots if this bug appears.

Changed in fuel:
status:	Confirmed → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.