First boot after provisioning fails: "The disk drive for /var/lib/nova is not ready or not present"

Bug #1373435 reported by Artem Panchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Vladimir Kozhukalov

Bug Description

Sometimes re-deployment of cluster on bare metal fails because provisioning timed out:

2014-09-23 14:02:10 ERR [455] Timeout of provisioning is exceeded. Nodes not booted: ["12", "15"]

During first boot after OS installation some computes hang because can't find LVM volume used for Nova and mount it in filesystem (see attached screenshot). Unfortunately, I can't provide an instruction how to reproduce this issue - it's floating. In my case two nodes (compute) hanged in that state and cluster deployment failed. Inspection of installer logs showed that there were two different but related issues during partitioning. On 1st compute (node-12) creation of LVM volume returned error:

http://paste.openstack.org/show/114956/

but I was able to manually create it after booting from drive using the same command. Probably that error occurred because installer attempted to create new LV right after new VG creation (0.09 sec) and some metadata still was being wrote to the disk at that moment:

2014-09-23T12:36:50.605851+01:00 notice: vgcreate
...
2014-09-23T12:36:50.648281+01:00 notice: vgs -a --noheadings
2014-09-23T12:36:50.655497+01:00 notice: vm 1 0 0 wz--n- 405.81g 405.81g
...
2014-09-23T12:36:50.697785+01:00 notice: lvcreate
2014-09-23T12:36:50.722800+01:00 notice: device-mapper: create ioctl failed: Device or resource busy

Possibly such errors can be avoided by an additional 'sleep' for few seconds between vgcreate/lvcreate commands or by adding '-Z n' flag to the last command, which will disable zeroing of volume first sectors and writing to the disk.

On the second compute (node-15) installer wasn't able to create new volume group:

http://paste.openstack.org/show/114965/

and then creation of LV also failed:

http://paste.openstack.org/show/114966/

In my opinion such issue could be caused by erasing of data on drive using 'dd' (we use it during environment/node deletion and before partitioning during provisioning): LVM metadata was removed, but '/dev/vm' directory wasn't deleted. We can simulate such condition in the following way:

root@node-12:~# vgs
  No volume groups found
root@node-12:~# mkdir /dev/vm
root@node-12:~# vgcreate -s 32m vm /dev/sda6
  /dev/vm: already exists in filesystem
  New volume group name "vm" is invalid
  Run `vgcreate --help' for more information.

I propose to add 'rm -rf /dev/${VG_NAME}' command to the 'before vgcreate' stage in Pmanager to prevent such errors.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Changed in fuel:
status: New → Triaged
assignee: Fuel Library Team (fuel-library) → Vladimir Kozhukalov (kozhukalov)
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/135255

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/135255
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5aaa3088ca0b65ddc58b0aaa45ee149700e3b85b
Submitter: Jenkins
Branch: master

commit 5aaa3088ca0b65ddc58b0aaa45ee149700e3b85b
Author: Vladimir Kozhukalov <email address hidden>
Date: Tue Nov 18 14:40:17 2014 +0300

    Added some hacks to ensure lv creating success

    * added vgremove right before vgcreate

    * inserted udevadm settle before vgremove and vgcreate.
      this is for avoiding udev race when there are
      unhandled events in udev queue

    Closes-Bug: 1373435
    Change-Id: Ic764bb1722ac45b432fc7c476f6e418a3715ac8a

Changed in fuel:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.