Fuel for OpenStack

nailgun agent with multipath stops working

Bug #1405265 reported by Andrey Grebennikov on 2014-12-23

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	Critical	Fuel Python (Deprecated)	Fuel for OpenStack 10.x-updates

Bug Description

I have Fuel 5.1 installation on CentOS.
Customer is using NetApp storage with iscsi and multipath.
In the very first time we try to create a volume from image, cinder attaches iscsi volume to the controller, copies the image into it and unmounts the volume. SOmetimes members of the multipath device don't disappear from the OS and it causes nailgun agent's crash.

There is a filesystem plugin of ohai - it executes set of commands "blkid", and if the volume is not connected to the host anymore, those commands will never finish.

I'd suggest to remove that plugin from the directory of ohai. Once I did it on the working environment - everything continued working. AFAIK we don't use the info from this plugin for our purposes.

Tags:

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-12-23:

@Andrey: Could you please provide more debug information?

Changed in fuel:
status:	New → Incomplete
importance:	Undecided → Medium
milestone:	none → 6.1

Revision history for this message

Andrey Grebennikov (agrebennikov) wrote on 2014-12-23:

The agent works in the next way - it finds the server which it needs to connect to, then sleeps for a random period of time, then executes ohai, then parses the data from its output, sends the info to the server and exits. In my case the agent hangs on the ohai step. Ohai in its turn opens all the plugins it has, one of them is "filesystem.rb", which executes a set of commands collecting the data about all filesystems at the moment. The process "blsid -s TYPE" hangs and the agent's process never finishes its job, so the node appears as "Offline" in the Fuel UI.
In order to resolve the current issue, I had to restart multipathd process so that it releases all non-existing devices, after that it becomes possible to kill those blkid processes as well.
Once I removed the plugin from ohai directory, the agent started to operate properly.

Revision history for this message

Roman Prykhodchenko (romcheg) wrote on 2014-12-23:

@Andrey: Thank you for the quick update.

Changed in fuel:
assignee:	nobody → Roman Prykhodchenko (romcheg)
status:	Incomplete → Confirmed

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2015-03-06:

With EMC + multipath the same shi^W uninterruptible sleep (I/O) on blkid -s TYPE (UUID/LABEL).

>> ohai/plugins/linux/filesystem.rb:
...
# Gather more filesystem types via libuuid, even devices that's aren't mounted
popen4("blkid -s TYPE") do |pid, stdin, stdout, stderr|
...
end

# Gather device UUIDs via libuuid
popen4("blkid -s UUID") do |pid, stdin, stdout, stderr|
...
end

# Gather device labels via libuuid
popen4("blkid -s LABEL") do |pid, stdin, stdout, stderr|
...
end

Commenting out the above code is wrapped up an issue.

Nastya Urlapova (aurlapova) on 2015-04-29

Changed in fuel:
milestone:	6.1 → 7.0

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2015-08-06:

Workaround: https://bugs.launchpad.net/fuel/+bug/1405265/comments/2
Can be moving to 8.0

Changed in fuel:
assignee:	Roman Prykhodchenko (romcheg) → Fuel Python Team (fuel-python)
status:	Confirmed → Won't Fix

Dmitry Pyzhov (dpyzhov) on 2015-10-12

Changed in fuel:
milestone:	7.0 → 8.0
status:	Won't Fix → Confirmed
no longer affects:	fuel/8.0.x

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-python

Alexander Kislitsky (akislitsky) on 2016-01-15

Changed in fuel:
milestone:	8.0 → 9.0

Revision history for this message

Alexander Kislitsky (akislitsky) wrote on 2016-01-15:

We passed SCF in 8.0. Moving the bug to 9.0.

Dmitry Pyzhov (dpyzhov) on 2016-03-02

tags:

added: module-volumes

Sergey Slipushenko (sslypushenko) on 2016-03-24

Changed in fuel:
status:	Confirmed → Fix Committed

Revision history for this message

Andrew Woodward (xarses) wrote on 2016-03-24:

What resolved this?

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-06-24:

Moved to 10.0, because review I didn't find + Andrey G. did provide additional information.
We should recheck it in new release.

Changed in fuel:
status:	Fix Committed → Confirmed
milestone:	9.0 → 10.0

Revision history for this message

Alexander Gordeev (a-gordeev) wrote on 2016-06-29:

looks like duplicate of https://bugs.launchpad.net/fuel/+bug/1396086

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-06-29:

#10

Marking as Fix Committed. QA team, please verify that the issue is not reproducible any more.

Changed in fuel:
status:	Confirmed → Fix Committed

Revision history for this message

Anatolii Neliubin (aneliubin) wrote on 2017-01-24:

#11

Still experiencing the same issue on MOS 9.1 + EMC + multipath.

Revision history for this message

Anatolii Neliubin (aneliubin) wrote on 2017-01-30:

#12

Is there any possibility to backport it to previous versions of MOS? The customers are experiencing the same issue om MOS 7.0

Revision history for this message

Anatolii Neliubin (aneliubin) wrote on 2017-01-30:

#13

Team this bug seems to be critical since there are about 30000 "blkid -s" processes that consume CPU and memory resources on a compute node.

Changed in fuel:
importance:	Medium → Critical
milestone:	10.0 → 7.0-mu-7

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2017-02-01:

#14

Anatolii, please don't change the original milestone of the bug, this makes reporting inefficient and adds a lot of confusion.

Changed in fuel:
milestone:	7.0-mu-7 → 10.x-updates

Revision history for this message

Denis Kostryukov (dkostryukov) wrote on 2017-05-18:

#15

Denis, one more customer has the same issue on MOS 7.0
He has more that 27000 "blkid -s TYPE" processes on his node.
Is it possible to create backport for MOS 7.0?

Fabrizio Soppelsa (fsoppelsa) on 2017-09-14

tags:

added: customer-found

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Related blueprints

Add multipath disks support

Remote bug watches

Bug watches keep track of this bug in other bug trackers.