Attempting to delete a baremetal server places server into ERROR state

Bug #1815799 reported by Lars Kellogg-Stedman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Unassigned
Rocky
Fix Committed
Medium
Julia Kreger
Stein
Fix Committed
Medium
Julia Kreger

Bug Description

When deleting a baremetal server, we see:

  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] Traceback (most recent call last):
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2698, in do_terminate_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._delete_instance(context, instance, bdms)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/hooks.py", line 154, in inner
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] rv = f(*args, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2635, in _delete_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._shutdown_instance(context, instance, bdms)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2527, in _shutdown_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] requested_networks)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self.force_reraise()
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] six.reraise(self.type_, self.value, self.tb)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2514, in _shutdown_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] block_device_info)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1245, in destroy
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._cleanup_deploy(node, instance, network_info)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 461, in _cleanup_deploy
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._remove_instance_info_from_node(node, instance)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 396, in _remove_instance_info_from_node
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self.ironicclient.call('node.update', node.uuid, patch)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 170, in call
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] return self._multi_getattr(client, method)(*args, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 360, in update
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] params=params)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 232, in _update
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] resp, body = self.api.json_request(method, url, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 678, in json_request
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] resp = self._http_request(url, method, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 287, in wrapper
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] return func(self, url, method, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 660, in _http_request
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] error_json.get('debuginfo'), method, url)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated while a state transition is in progress. (HTTP 409)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a]

This places the server into an ERROR state:

  $ openstack server list
  +--------------------------------------+------------+--------+----------------------------+---------+------------+
  | ID | Name | Status | Networks | Image | Flavor |
  +--------------------------------------+------------+--------+----------------------------+---------+------------+
  | dcb4f055-cda4-4d61-ba8f-976645c4e92a | larstest2 | ERROR | | | baremetal |
  +--------------------------------------+------------+--------+----------------------------+---------+------------+

A subsequent "openstack server delete" will successfully delete the server.

On the ironic side, we see the following message logged several times before Nova reports the conflict:

  2019-02-13 12:58:15.138 21 DEBUG wsme.api [req-b6f0a1a3-e2cc-4b13-a979-3676757b0972 7a9b1ac45e084e7cbeb9cb740ffe8d08 41ea8af8d00e46438c7be3b182bbb53f - default default] Client-side error: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated while a state transition is in progress. format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222

This is with nova @ ad842aa and ironic @ 4404292.

Tags: ironic
tags: added: ironic
Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm not sure what nova can do about this besides waiting / retrying the delete of the ironic node before giving up. The libvirt driver has some retry stuff in it:

https://github.com/openstack/nova/blob/63e5cba88af5cd121b498cc46358eef19f83b69f/nova/virt/libvirt/driver.py#L888

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

In this specific case, it is not needed, Ironic has long nuked the field. I'll put up a patch to gracefully handle this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/643484

Changed in nova:
assignee: nobody → Julia Kreger (juliaashleykreger)
status: Confirmed → In Progress
Revision history for this message
Mark Goddard (mgoddard) wrote :

Also hitting this, on Rocky release. It was pretty obvious, not sure how we've not seen it sooner.

While the error message mentions locking, I think this is a red herring - I added debug and there is no lock held. The update is prevented due to a state transition being in progress while cleaning is performed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/659809

Changed in nova:
assignee: Julia Kreger (juliaashleykreger) → Mark Goddard (mgoddard)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Julia Kreger (<email address hidden>) on branch: master
Review: https://review.opendev.org/643484

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/659809
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7
Submitter: Zuul
Branch: master

commit 060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7
Author: Mark Goddard <email address hidden>
Date: Fri May 17 14:58:59 2019 +0100

    [ironic] Don't remove instance info twice in destroy

    During teardown, at the end of the ironic virt driver's destroy method,
    we call _cleanup_deploy, which since
    https://review.opendev.org/#/c/563722/ removes instance_info from the
    ironic node. Given that we're destroying the node, the instance_info
    will have been cleared from the node anyway, so we don't need to do
    this.

    Further, it can cause tear down to fail if automated cleaning is
    enabled in ironic. This happens because ironic prevents updates to nodes
    that are undergoing a state transition, as is the case during cleaning.
    The virt driver will retry this request by default every 2 seconds with
    60 attempts. Typically this is not long enough for cleaning to complete,
    so tear down fails with the following error:

      Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
      while a state transition is in progress. (HTTP 409)

    This change skips the instance info update in _cleanup_deploy in the
    case of tear down.

    Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
    Closes-Bug: #1815799

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/662479

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/662584

Matt Riedemann (mriedem)
Changed in nova:
importance: Low → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/662479
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=95ace7cf8daf4ff935ff6df877554fdfddef9e4a
Submitter: Zuul
Branch: stable/stein

commit 95ace7cf8daf4ff935ff6df877554fdfddef9e4a
Author: Mark Goddard <email address hidden>
Date: Fri May 17 14:58:59 2019 +0100

    [ironic] Don't remove instance info twice in destroy

    During teardown, at the end of the ironic virt driver's destroy method,
    we call _cleanup_deploy, which since
    https://review.opendev.org/#/c/563722/ removes instance_info from the
    ironic node. Given that we're destroying the node, the instance_info
    will have been cleared from the node anyway, so we don't need to do
    this.

    Further, it can cause tear down to fail if automated cleaning is
    enabled in ironic. This happens because ironic prevents updates to nodes
    that are undergoing a state transition, as is the case during cleaning.
    The virt driver will retry this request by default every 2 seconds with
    60 attempts. Typically this is not long enough for cleaning to complete,
    so tear down fails with the following error:

      Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
      while a state transition is in progress. (HTTP 409)

    This change skips the instance info update in _cleanup_deploy in the
    case of tear down.

    Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
    Closes-Bug: #1815799
    (cherry picked from commit 060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.1

This issue was fixed in the openstack/nova 19.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/662584
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1ed3f8f9e487c233fdf714edad57d126fe6b1cd7
Submitter: Zuul
Branch: stable/rocky

commit 1ed3f8f9e487c233fdf714edad57d126fe6b1cd7
Author: Mark Goddard <email address hidden>
Date: Fri May 17 14:58:59 2019 +0100

    [ironic] Don't remove instance info twice in destroy

    During teardown, at the end of the ironic virt driver's destroy method,
    we call _cleanup_deploy, which since
    https://review.opendev.org/#/c/563722/ removes instance_info from the
    ironic node. Given that we're destroying the node, the instance_info
    will have been cleared from the node anyway, so we don't need to do
    this.

    Further, it can cause tear down to fail if automated cleaning is
    enabled in ironic. This happens because ironic prevents updates to nodes
    that are undergoing a state transition, as is the case during cleaning.
    The virt driver will retry this request by default every 2 seconds with
    60 attempts. Typically this is not long enough for cleaning to complete,
    so tear down fails with the following error:

      Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
      while a state transition is in progress. (HTTP 409)

    This change skips the instance info update in _cleanup_deploy in the
    case of tear down.

    Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
    Closes-Bug: #1815799
    (cherry picked from commit 060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7)
    (cherry picked from commit 95ace7cf8daf4ff935ff6df877554fdfddef9e4a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.2.1

This issue was fixed in the openstack/nova 18.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message
Mark Goddard (mgoddard) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/883411

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/883411
Committed: https://opendev.org/openstack/nova/commit/6ebd8a56d1c79fc15f385f232ce0e22937b2fdac
Submitter: "Zuul (22348)"
Branch: master

commit 6ebd8a56d1c79fc15f385f232ce0e22937b2fdac
Author: Mark Goddard <email address hidden>
Date: Wed May 17 14:20:54 2023 +0100

    ironic: Fix ConflictException when deleting server

    When unprovision works via Ironic, all operations in _cleanup_deploy
    have already been completed. Previous to this patch, we continue
    attempting all the clean up steps, which eventually errors out with
    BadRequest, or similar, and we complete the delete.

    Howerver, if cleaning has started, we hit a conflict exception,
    so we don't hit the expected error above.

    Prior to moving to the SDK, that landed in Caracal,
    we would retry on conflict errors. You can tweak the
    config to keep retrying for the length of time cleaning
    usually takes in your enviroment.

    After this patch:
    Ieda5636a5e80ea4af25db2e90be241869759e30c

    We now hard fail with this error:

    openstack.exceptions.ConflictException:
    Failed to detach VIF xxx from bare metal node yyy
    ...
    Node yyy is locked by host zzz,
    please retry after the current operation is completed.

    This change simply skips calling the operations that
    will always error out, avoiding the need to wait for
    cleainging to complete before getting the expected
    error message.

    Closes-Bug: #2019977
    Related-Bug: #1815799
    Change-Id: I60971b58cf1f24bdb5d40f668e380ebfee2ac495

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/2024.2)

Related fix proposed to branch: stable/2024.2
Review: https://review.opendev.org/c/openstack/nova/+/940836

Mark Goddard (mgoddard)
Changed in nova:
assignee: Mark Goddard (mgoddard) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/2024.1)

Related fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/nova/+/940846

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/2024.1)

Change abandoned by "suiong ng <email address hidden>" on branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/nova/+/940846
Reason: will cherry pick again from 2024.2

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/2024.2)

Reviewed: https://review.opendev.org/c/openstack/nova/+/940836
Committed: https://opendev.org/openstack/nova/commit/1b28f649feaf2c9929f15214814f8af950e5c19c
Submitter: "Zuul (22348)"
Branch: stable/2024.2

commit 1b28f649feaf2c9929f15214814f8af950e5c19c
Author: Mark Goddard <email address hidden>
Date: Wed May 17 14:20:54 2023 +0100

    ironic: Fix ConflictException when deleting server

    When unprovision works via Ironic, all operations in _cleanup_deploy
    have already been completed. Previous to this patch, we continue
    attempting all the clean up steps, which eventually errors out with
    BadRequest, or similar, and we complete the delete.

    Howerver, if cleaning has started, we hit a conflict exception,
    so we don't hit the expected error above.

    Prior to moving to the SDK, that landed in Caracal,
    we would retry on conflict errors. You can tweak the
    config to keep retrying for the length of time cleaning
    usually takes in your enviroment.

    After this patch:
    Ieda5636a5e80ea4af25db2e90be241869759e30c

    We now hard fail with this error:

    openstack.exceptions.ConflictException:
    Failed to detach VIF xxx from bare metal node yyy
    ...
    Node yyy is locked by host zzz,
    please retry after the current operation is completed.

    This change simply skips calling the operations that
    will always error out, avoiding the need to wait for
    cleainging to complete before getting the expected
    error message.

    Closes-Bug: #2019977
    Related-Bug: #1815799
    Change-Id: I60971b58cf1f24bdb5d40f668e380ebfee2ac495
    (cherry picked from commit 6ebd8a56d1c79fc15f385f232ce0e22937b2fdac)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/2024.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/940846
Committed: https://opendev.org/openstack/nova/commit/d5bda4a795c9b0430467ee923627b570ecf3f92a
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit d5bda4a795c9b0430467ee923627b570ecf3f92a
Author: Mark Goddard <email address hidden>
Date: Wed May 17 14:20:54 2023 +0100

    ironic: Fix ConflictException when deleting server

    When unprovision works via Ironic, all operations in _cleanup_deploy
    have already been completed. Previous to this patch, we continue
    attempting all the clean up steps, which eventually errors out with
    BadRequest, or similar, and we complete the delete.

    Howerver, if cleaning has started, we hit a conflict exception,
    so we don't hit the expected error above.

    Prior to moving to the SDK, that landed in Caracal,
    we would retry on conflict errors. You can tweak the
    config to keep retrying for the length of time cleaning
    usually takes in your enviroment.

    After this patch:
    Ieda5636a5e80ea4af25db2e90be241869759e30c

    We now hard fail with this error:

    openstack.exceptions.ConflictException:
    Failed to detach VIF xxx from bare metal node yyy
    ...
    Node yyy is locked by host zzz,
    please retry after the current operation is completed.

    This change simply skips calling the operations that
    will always error out, avoiding the need to wait for
    cleainging to complete before getting the expected
    error message.

    Closes-Bug: #2019977
    Related-Bug: #1815799
    Change-Id: I60971b58cf1f24bdb5d40f668e380ebfee2ac495
    (cherry picked from commit 6ebd8a56d1c79fc15f385f232ce0e22937b2fdac)
    (cherry picked from commit 1b28f649feaf2c9929f15214814f8af950e5c19c)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.