[gates] OOM failures on CI

Bug #1623394 reported by Timur Nurlygayanov
50
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Fuel Sustaining
Mitaka
Fix Committed
High
Fuel Sustaining

Bug Description

The gates are broken, example:

https://review.openstack.org/#/c/369427/3/

in the logs we can see the following error:

2016-09-14T06:26:54.030000+00:00 warning: [ 2687.871970] nova-api invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
2016-09-14T06:26:54.030367+00:00 err: [ 2687.872346] Out of memory: Kill process 411 (mysqld) score 45 or sacrifice child
2016-09-14T06:26:54.030367+00:00 err: [ 2687.872419] Killed process 411 (mysqld) total-vm:2547180kB, anon-rss:115368kB, file-rss:0kB

The root of the issue:
We discussed the issue with Maksim Malchuk and he suggested to increase the size of RAM on controller nodes in this job to make gate more stable.

Note:
We need to review all gates where we run BVT/SWARM tests and make sure we increase the RAM size for all of them to avoid random false-negative fails.

Also failed tests due to out of memory issue:

https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4628

https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.smoke_neutron/656/
https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.smoke_neutron/652/

https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4678/
https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4683/
https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4681/
https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4679/

https://ci.fuel-infra.org/job/master.puppet-openstack.fuel-library.pkgs.ubuntu.smoke_neutron/4695/

Tags: area-ci
description: updated
tags: added: area-ci
Changed in fuel:
assignee: nobody → Fuel CI (fuel-ci)
importance: Undecided → High
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

We mustn't increase RAM for tests! First step is check that gate use *23 version of MySQL, second - investigate a real cause of failure.

Revision history for this message
Roman Vyalov (r0mikiam) wrote :

please update the iso on the fuel-ci

summary: - The gates are broken: need to increase the size of RAM, MySQL crashed
+ [gates] Cluster is not deployed: some nodes are in the Error state
Revision history for this message
Roman Vyalov (r0mikiam) wrote : Re: [gates] Cluster is not deployed: some nodes are in the Error state

for stabel/mitaka yesterday we downgraded the mysql version, and now we are updating the iso on fuel-ci.

but for master you can discuss with qa team about the memory on hw and etc

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Roman, we should downgrade the version of package in both master and stable.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Nastya, *23 version needed only for mitaka to solve split-brain issues, the master affected by the another problem oom-killer kills not only the mysqld.

no longer affects: fuel/newton
Revision history for this message
Roman Vyalov (r0mikiam) wrote :

we cannot increase the RAM for tests! please solve the problem with mysql.
also fyi the configuration for the env (RAM etc) is managing in the fuel-devops/qa code

Changed in fuel:
assignee: Dmitry Kaigarodеsev (dkaiharodsev) → nobody
status: Confirmed → New
Changed in fuel:
assignee: nobody → Fuel QA Team (fuel-qa)
status: New → Confirmed
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

What the real reason of such behavior? I am really sorry, but from description it is absolutely unclear.

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Maksim Malchuk (mmalchuk)
status: Confirmed → Won't Fix
status: Won't Fix → Incomplete
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

ok, I will update the description and add the bunch of failed tests killed by the out of memory.

description: updated
Changed in fuel:
status: Incomplete → Confirmed
assignee: Maksim Malchuk (mmalchuk) → Dmitry Kaigarodеsev (dkaiharodsev)
Revision history for this message
Georgy Kibardin (gkibardin) wrote :
Revision history for this message
Roman Vyalov (r0mikiam) wrote :

please discuss with QA (in the code fuel-qa/devops) team about increate RAM in the tests!

Changed in fuel:
assignee: Dmitry Kaigarodеsev (dkaiharodsev) → Maksim Malchuk (mmalchuk)
status: Confirmed → New
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/370174

Changed in fuel:
status: New → In Progress
Revision history for this message
Nikita Karpin (mkarpin) wrote : Re: [gates] Cluster is not deployed: some nodes are in the Error state
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Nikita, this fix not for one simple job, the memory increased for all slaves in default configuration for almost all tests.

Revision history for this message
Vasyl Saienko (vsaienko) wrote :

According to node-1 logs from: https://ci.fuel-infra.org/job/master.fuel-library.pkgs.ubuntu.smoke_neutron/7842/

The server has been swapped, in the atop logs we can see that starting from 2016/09/19 11:52:31 all swap was used. https://paste.mirantis.net/show/2659/

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Yes, Vasyl, using swap also affects IO that caused the another issues with MySQL services.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :
Revision history for this message
Roman Vyalov (r0mikiam) wrote : </