nailgun agent with multipath stops working

Bug #1405265 reported by Andrey Grebennikov
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Critical
Fuel Python (Deprecated)

Bug Description

I have Fuel 5.1 installation on CentOS.
Customer is using NetApp storage with iscsi and multipath.
In the very first time we try to create a volume from image, cinder attaches iscsi volume to the controller, copies the image into it and unmounts the volume. SOmetimes members of the multipath device don't disappear from the OS and it causes nailgun agent's crash.

There is a filesystem plugin of ohai - it executes set of commands "blkid", and if the volume is not connected to the host anymore, those commands will never finish.

I'd suggest to remove that plugin from the directory of ohai. Once I did it on the working environment - everything continued working. AFAIK we don't use the info from this plugin for our purposes.

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

@Andrey: Could you please provide more debug information?

Changed in fuel:
status: New → Incomplete
importance: Undecided → Medium
milestone: none → 6.1
Revision history for this message
Andrey Grebennikov (agrebennikov) wrote :

The agent works in the next way - it finds the server which it needs to connect to, then sleeps for a random period of time, then executes ohai, then parses the data from its output, sends the info to the server and exits. In my case the agent hangs on the ohai step. Ohai in its turn opens all the plugins it has, one of them is "filesystem.rb", which executes a set of commands collecting the data about all filesystems at the moment. The process "blsid -s TYPE" hangs and the agent's process never finishes its job, so the node appears as "Offline" in the Fuel UI.
In order to resolve the current issue, I had to restart multipathd process so that it releases all non-existing devices, after that it becomes possible to kill those blkid processes as well.
Once I removed the plugin from ohai directory, the agent started to operate properly.

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

@Andrey: Thank you for the quick update.

Changed in fuel:
assignee: nobody → Roman Prykhodchenko (romcheg)
status: Incomplete → Confirmed
Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :

With EMC + multipath the same shi^W uninterruptible sleep (I/O) on blkid -s TYPE (UUID/LABEL).

>> ohai/plugins/linux/filesystem.rb:
...
# Gather more filesystem types via libuuid, even devices that's aren't mounted
popen4("blkid -s TYPE") do |pid, stdin, stdout, stderr|
...
end

# Gather device UUIDs via libuuid
popen4("blkid -s UUID") do |pid, stdin, stdout, stderr|
...
end

# Gather device labels via libuuid
popen4("blkid -s LABEL") do |pid, stdin, stdout, stderr|
...
end

Commenting out the above code is wrapped up an issue.

Changed in fuel:
milestone: 6.1 → 7.0
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :
Changed in fuel:
assignee: Roman Prykhodchenko (romcheg) → Fuel Python Team (fuel-python)
status: Confirmed → Won't Fix
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 7.0 → 8.0
status: Won't Fix → Confirmed
no longer affects: fuel/8.0.x
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Changed in fuel:
milestone: 8.0 → 9.0
Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

We passed SCF in 8.0. Moving the bug to 9.0.

Dmitry Pyzhov (dpyzhov)
tags: added: module-volumes
Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Andrew Woodward (xarses) wrote :

What resolved this?

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Moved to 10.0, because review I didn't find + Andrey G. did provide additional information.
We should recheck it in new release.

Changed in fuel:
status: Fix Committed → Confirmed
milestone: 9.0 → 10.0
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Marking as Fix Committed. QA team, please verify that the issue is not reproducible any more.

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Anatolii Neliubin (aneliubin) wrote :

Still experiencing the same issue on MOS 9.1 + EMC + multipath.

Revision history for this message
Anatolii Neliubin (aneliubin) wrote :

Is there any possibility to backport it to previous versions of MOS? The customers are experiencing the same issue om MOS 7.0

Revision history for this message
Anatolii Neliubin (aneliubin) wrote :

Team this bug seems to be critical since there are about 30000 "blkid -s" processes that consume CPU and memory resources on a compute node.

Changed in fuel:
importance: Medium → Critical
milestone: 10.0 → 7.0-mu-7
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Anatolii, please don't change the original milestone of the bug, this makes reporting inefficient and adds a lot of confusion.

Changed in fuel:
milestone: 7.0-mu-7 → 10.x-updates
Revision history for this message
Denis Kostryukov (dkostryukov) wrote :

Denis, one more customer has the same issue on MOS 7.0
He has more that 27000 "blkid -s TYPE" processes on his node.
Is it possible to create backport for MOS 7.0?

tags: added: customer-found
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.