Evacuate fails - 'Invalid state of instance files on shared storage', using NFS

Bug #1617333 reported by Dmitry Goloshubov
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
Matthew Roark
7.0.x
Invalid
High
MOS Maintenance
8.0.x
Invalid
High
MOS Maintenance
9.x
Invalid
High
Matthew Roark

Bug Description

Steps to reproduce:
- environment: mos7.0, 2 compute nodes - /var/lib/nova/instances is on the shared NFS storage.
- shutdown a compute node.
- try to evacuate instances from that compute node:
# nova host-evacuate --on-shared-storage --target_host <targetComputeNode> <failedComputNode>

result:
'# nova-manage vm list': instances are in error state and the host is not changed.

nova-compute.log:

2016-08-24 13:35:38.918 6148 INFO nova.compute.manager [req-2c224a10-7061-4e65-b538-130416ccf357 10b220abcf764371b0023567bd38d3ad cf2c57ebfaff44a9b5ec9c51164b5e2d - - -] [instance: e4b9ccda-e364-4aba-8ebe-fd7e929e4c9f] Rebuilding instance
2016-08-24 13:35:38.921 6148 DEBUG nova.virt.libvirt.driver [req-2c224a10-7061-4e65-b538-130416ccf357 10b220abcf764371b0023567bd38d3ad cf2c57ebfaff44a9b5ec9c51164b5e2d - - -] Checking instance files accessibility /var/lib/nova/instances/e4b9ccda-e364-4aba-8ebe-fd7e929e4c9f instance_on_disk /usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py:6742
2016-08-24 13:35:38.923 6148 ERROR nova.compute.manager [req-2c224a10-7061-4e65-b538-130416ccf357 10b220abcf764371b0023567bd38d3ad cf2c57ebfaff44a9b5ec9c51164b5e2d - - -] [instance: e4b9ccda-e364-4aba-8ebe-fd7e929e4c9f] Setting instance vm_state to ERROR
2016-08-24 13:35:38.923 6148 TRACE nova.compute.manager [instance: e4b9ccda-e364-4aba-8ebe-fd7e929e4c9f] Traceback (most recent call last):
2016-08-24 13:35:38.923 6148 TRACE nova.compute.manager [instance: e4b9ccda-e364-4aba-8ebe-fd7e929e4c9f] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6489, in _error_out_instance_on_exception
2016-08-24 13:35:38.923 6148 TRACE nova.compute.manager [instance: e4b9ccda-e364-4aba-8ebe-fd7e929e4c9f] yield
2016-08-24 13:35:38.923 6148 TRACE nova.compute.manager [instance: e4b9ccda-e364-4aba-8ebe-fd7e929e4c9f] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 3005, in rebuild_instance
2016-08-24 13:35:38.923 6148 TRACE nova.compute.manager [instance: e4b9ccda-e364-4aba-8ebe-fd7e929e4c9f] _("Invalid state of instance files on shared"
2016-08-24 13:35:38.923 6148 TRACE nova.compute.manager [instance: e4b9ccda-e364-4aba-8ebe-fd7e929e4c9f] InvalidSharedStorage: Invalid state of instance files on shared storage
2016-08-24 13:35:38.923 6148 TRACE nova.compute.manager [instance: e4b9ccda-e364-4aba-8ebe-fd7e929e4c9f]
2016-08-24 13:35:39.145 6148 ERROR oslo_messaging.rpc.dispatcher [req-2c224a10-7061-4e65-b538-130416ccf357 10b220abcf764371b0023567bd38d3ad cf2c57ebfaff44a9b5ec9c51164b5e2d - - -] Exception during message handling: Invalid state of instance files on shared storage
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher executor_callback))
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher executor_callback)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6877, in rebuild_instance
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher preserve_ephemeral=preserve_ephemeral)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 453, in decorated_function
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 142, in inner
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher return func(*args, **kwargs)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 88, in wrapped
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher payload)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 85, in __exit__
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 71, in wrapped
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher return f(self, context, *args, **kw)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 333, in decorated_function
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher LOG.warning(msg, e, instance_uuid=instance_uuid)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 85, in __exit__
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 304, in decorated_function
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 383, in decorated_function
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 361, in decorated_function
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 85, in __exit__
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 349, in decorated_function
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 3005, in rebuild_instance
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher _("Invalid state of instance files on shared"
2016-08-24 13:35:39.145 6148 TRACE oslo_messaging.rpc.dispatcher InvalidSharedStorage: Invalid state of instance files on shared storage

Revision history for this message
Dmitry Goloshubov (dgoloshubov) wrote :

Seems it's related with the following nova bug:
https://bugs.launchpad.net/nova/+bug/1340411

tags: added: area-nova
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Dmitry, please provide a diagnostic snapshot or (at the very least) a complete nova-compute log.

Revision history for this message
Matthew Roark (mroark) wrote :

@rpodolyaka: No diagnostic snapshot has been gathered, but we do have a nova-compute.log from the customer which I've attached. If a full snapshot is still required, please let us know.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Matthew,

I've checked the logs and see that code fails here:

https://github.com/openstack/nova/blob/cd81512f87476b0bad29db379d3b92b22723cd2e/nova/compute/manager.py#L3001-L3005

which basically means this check returns False:

https://github.com/openstack/nova/blob/cd81512f87476b0bad29db379d3b92b22723cd2e/nova/virt/libvirt/driver.py#L6710-L6720

thus, the instance path is not accessible from the target compute node:

https://github.com/openstack/nova/blob/cd81512f87476b0bad29db379d3b92b22723cd2e/nova/virt/libvirt/driver.py#L6714

Are you sure the NFS share works properly? Is the NFS server in the same cluster or it's external?

A diagnostic snapshot would probably help. So far it looks like a valid error due to unavailable shared instance storage.

Revision history for this message
Matthew Roark (mroark) wrote :

@rpodolyaka: The customer was able to provide us with a partial snapshot, one compute and one controller node. https://drive.google.com/a/mirantis.com/file/d/0B1Llc51Lfvo6cjVFaW0tYUJVcEU/view?usp=sharing

Additionally, the instances in question are backed by a bootable volume. It's possible this is more related to: https://bugs.launchpad.net/nova/+bug/1522496

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Incomplete - moved to 9.2

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I tried this on vanilla 7.0 following the steps to reproduce - evacuation works as expected, the instance is started successfully on the remaining compute node:

2016-10-04 15:22:39.609 13462 INFO nova.compute.manager [req-7a775467-92aa-4234-bbb4-5d9b3f9e27f2 97f5e7829dbc48f69dbace55c3e2da5c b2c8e17931a045189a28917e2e0f0bda - - -] [instance: 86db1236-4d8d43e1-8716-46c72d80e20e] Rebuilding instance
2016-10-04 15:22:39.610 13462 INFO nova.compute.manager [req-7a775467-92aa-4234-bbb4-5d9b3f9e27f297f5e7829dbc48f69dbace55c3e2da5c b2c8e17931a045189a28917e2e0f0bda - - -] disk on shared storage, recreating using existing disk

^ the attached snapshot does not contain nova-compute logs for 2016-08-24 when the error occurred, so it's not really useful.

As I said before, the failed check is fairly straightforward:

https://github.com/openstack/nova/blob/cd81512f87476b0bad29db379d3b92b22723cd2e/nova/virt/libvirt/driver.py#L6714

it will fail, if the instance path is *not* accessible. Thus, I suggest double check that the NFS server works as expected and the same mount point is used on *all* compute nodes.

If it's reproduced again, please attach the full logs, instance UUID and the output of "ls -la /var/lib/nova/instances" command from the target host.

Feel free to ping us on #mos-nova , if you want to have someone from Nova team on the call to debug this live.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.