quota is always in-use after delete the ERROR instances

Bug #1670627 reported by huangtianhua
60
This bug affects 12 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Matt Riedemann
Ocata
Fix Released
Critical
Matt Riedemann

Bug Description

1. stop nova-compute
2. boot an instance
3. the instance is in ERROR
4. delete the instance
5. repeat 1-4 for several times(actually, I create the instance by heat, heat will retry to create the instance 5 times by default and will delete the error instance before every retrying)
6. I can't boot instance after nova-compute back, the reason is "Quota exceeded for instances: Requested 1, but already used 10 of 10 instances"
7. but in fact, there is no instance return by cmd nova-list
8. I find that the quota is still in-use, see table 'quota_usages':

mysql> select *from quota_usages;
+---------------------+---------------------+------------+----+----------------------------------+-----------+--------+----------+---------------+---------+----------------------------------+
| created_at | updated_at | deleted_at | id | project_id | resource | in_use | reserved | until_refresh | deleted | user_id |
+---------------------+---------------------+------------+----+----------------------------------+-----------+--------+----------+---------------+---------+----------------------------------+
| 2017-03-07 06:26:08 | 2017-03-07 08:48:09 | NULL | 1 | 2b623ba1dddc476cbb7728a944d539c5 | instances | 10 | 0 | NULL | 0 | 8d57d7a267b54992b382a6607ecd700a |
| 2017-03-07 06:26:08 | 2017-03-07 08:48:09 | NULL | 2 | 2b623ba1dddc476cbb7728a944d539c5 | ram | 5120 | 0 | NULL | 0 | 8d57d7a267b54992b382a6607ecd700a |
| 2017-03-07 06:26:08 | 2017-03-07 08:48:09 | NULL | 3 | 2b623ba1dddc476cbb7728a944d539c5 | cores | 10 | 0 | NULL | 0 | 8d57d7a267b54992b382a6607ecd700a |
| 2017-03-07 09:17:37 | 2017-03-07 09:35:14 | NULL | 4 | 12bdc74d666d4f7687c0172a003f190d | instances | 13 | 0 | NULL | 0 | 98887477e65e43f383f8a9ec732a3eae |
| 2017-03-07 09:17:37 | 2017-03-07 09:35:14 | NULL | 5 | 12bdc74d666d4f7687c0172a003f190d | ram | 6656 | 0 | NULL | 0 | 98887477e65e43f383f8a9ec732a3eae |
| 2017-03-07 09:17:37 | 2017-03-07 09:35:14 | NULL | 6 | 12bdc74d666d4f7687c0172a003f190d | cores | 13 | 0 | NULL | 0 | 98887477e65e43f383f8a9ec732a3eae |
+---------------------+---------------------+------------+----+----------------------------------+-----------+--------+----------+---------------+---------+----------------------------------+

Changed in nova:
assignee: nobody → huangtianhua (huangtianhua)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/443003

Changed in nova:
status: New → In Progress
Revision history for this message
huangtianhua (huangtianhua) wrote :

I tried to fix this bug with https://review.openstack.org/#/c/443003/ , I cleaned the related records and took tests for it:
1. nova-compute is down
2. create 10 instances, they were in ERROR status
3. delete the 10 error instances
4. nova-compute is up
5. to create instance again

But I still got the error: "ERROR (Forbidden): Quota exceeded for instances: Requested 1, but already used 10 of 10 instances".

I found that the records of quota_usages which in 'nova_cell0' database was updated (not the records of quota_usages in 'nova' database). Seems the implementation now is to check the quota with the records of 'nova' database when creating, and update quota usage with the records of 'nova_cell0' when deleting an error instance.

Revision history for this message
Zhenyu Zheng (zhengzhenyu) wrote :

After some tests in my devstack deployment(there are now 3 DB - nova, nova_api and nova_cell0), it seems that when booting, the quota is calculated and recorded in nova DB, if it booted successfully, cool, no bugs. But if it failed to boot, the instance will end up in cell0, and when delete this instance, the quota cannot be freed correctly in nova DB as we transformed the db connection to nova_cell0, and the usage in nova DB will still exist. I will try to study more to see if this is due to the devstack all-in-one deployment or this could be a general bug.

Matt Riedemann (mriedem)
tags: added: cells ocata-backport-potential quotas
Changed in nova:
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :

So the instance gets put into the cell0 database and the _bury_in_cell0 method in conductor manager deletes the build request. Then when we delete the instance, the build request is gone but we find the instance via the instance mapping and know it's in cell0. The problem is we don't do any quota cleanup here:

https://github.com/openstack/nova/blob/master/nova/compute/api.py#L1799

Like we would have if we deleted the build request in the API (which is not the case for this bug):

https://github.com/openstack/nova/blob/master/nova/compute/api.py#L1729-L1756

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/443395

Changed in nova:
assignee: huangtianhua (huangtianhua) → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :

Can you see if this fixes the issue?

https://review.openstack.org/#/c/443395/

Matt Riedemann (mriedem)
Changed in nova:
importance: High → Critical
Revision history for this message
Matt Riedemann (mriedem) wrote :

I think I found the issue.

When we lookup the instance object from compute.API.get() we eventually get here:

https://github.com/openstack/nova/blob/fa2b4a82648101826566da68dd56d204e269853f/nova/compute/api.py#L2294

That's when we've found the instance mapping and it points to cell0. At that point, we modify the context object and point it at cell0. Since the context is passed by reference back up to the REST API handler code, it's then passed back down into compute.API.delete().

So the reason that https://review.openstack.org/443003 and https://review.openstack.org/#/c/443395/ don't work to decrement quota in the main nova db is because the context is targeted at the cell0 DB.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Can you provide the git hash you're using to recreate this? Are you testing against Pike or Ocata 15.0.0?

Revision history for this message
Zhenyu Zheng (zhengzhenyu) wrote :

@Matt, I'm recreate this using the latest master, the git hash is:

commit 8fe48af1625cd2deca496de81dd72573e78b3ef2
Merge: 713f17c bf697f5
Author: Jenkins <email address hidden>
Date: Tue Mar 7 00:14:19 2017 +0000

    Merge "lib/neutron: untangle metering configuration from legacy"

Revision history for this message
Zhenyu Zheng (zhengzhenyu) wrote :

It's a devstack env FYI

Revision history for this message
huangtianhua (huangtianhua) wrote :

I took the test after the commit 5cf6bbf374a....

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/443403

Revision history for this message
Matt Riedemann (mriedem) wrote :

Try with https://review.openstack.org/#/c/443403/ and the patch below it.

Changed in nova:
assignee: Matt Riedemann (mriedem) → Dan Smith (danms)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/443745

Changed in nova:
assignee: Dan Smith (danms) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/443003
Reason: Let's drop this one, https://review.openstack.org/#/c/443395/ is more complete at this point.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/443745
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=500282d867a15b441068d9327ddcb9ae69f41b7d
Submitter: Jenkins
Branch: master

commit 500282d867a15b441068d9327ddcb9ae69f41b7d
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 9 11:40:33 2017 -0500

    Add regression test for bug 1670627

    This adds a functional regression test for bug 1670627.

    This is the recreate scenario. Patches that are proposed to
    fix the bug will build on top of this and change it's assertions
    to know when it's properly fixed.

    Change-Id: I872a3fd5cfd3dd869f74cd3fd0aa5da411b1fec3
    Related-Bug: #1670627

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/443395
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=018068c4caac324643c7c6a4360fad855dd096eb
Submitter: Jenkins
Branch: master

commit 018068c4caac324643c7c6a4360fad855dd096eb
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 8 21:51:07 2017 -0500

    Decrement quota usage when deleting an instance in cell0

    When we fail to schedule an instance, e.g. there are no hosts
    available, conductor creates the instance in the cell0 database
    and deletes the build request. At this point quota usage
    has been incremented in the main 'nova' database.

    When the instance is deleted, the build request is already gone
    so _delete_while_booting returns False and we lookup the instance
    in cell0 and delete it from there, but that flow wasn't decrementing
    quota usage like _delete_while_booting was.

    This change adds the same quota usage decrement handling that
    _delete_while_booting performs.

    Change-Id: I4cb0169ce0de537804ab9129bc671d75ce5f7953
    Partial-Bug: #1670627

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/443403
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=edf51119fa59ff8a3337abb9107a06fa33d3c68f
Submitter: Jenkins
Branch: master

commit edf51119fa59ff8a3337abb9107a06fa33d3c68f
Author: Matt Riedemann <email address hidden>
Date: Sat Mar 11 17:59:43 2017 -0500

    Temporarily untarget context when deleting from cell0

    When deleting an instance we look it up in the _get_instance
    method and if it's in cell0 then the context is permanently
    targeted to that cell via the set_target_cell method.

    When we delete the instance in _delete we need to temporarily
    untarget the context when we decrement the quota usage otherwise
    the quota usage gets decremented in the cell0 database rather than
    the cell database. Once the instance is deleted then we
    re-apply the target cell on the context.

    Change-Id: I7de87dce216835729283bca69f0eff59a679b624
    Closes-Bug: #1670627

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/445235

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/445236

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/445235
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=73de7b365cddc5404290ce6a86fcae03ad22e975
Submitter: Jenkins
Branch: stable/ocata

commit 73de7b365cddc5404290ce6a86fcae03ad22e975
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 9 11:40:33 2017 -0500

    Add regression test for bug 1670627

    This adds a functional regression test for bug 1670627.

    This is the recreate scenario. Patches that are proposed to
    fix the bug will build on top of this and change it's assertions
    to know when it's properly fixed.

    Change-Id: I872a3fd5cfd3dd869f74cd3fd0aa5da411b1fec3
    Related-Bug: #1670627
    (cherry picked from commit 500282d867a15b441068d9327ddcb9ae69f41b7d)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/445236
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f8c2df78f2ff8b58d72e55391b02ae8c6faa4bc1
Submitter: Jenkins
Branch: stable/ocata

commit f8c2df78f2ff8b58d72e55391b02ae8c6faa4bc1
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 8 21:51:07 2017 -0500

    Decrement quota usage when deleting an instance in cell0

    When we fail to schedule an instance, e.g. there are no hosts
    available, conductor creates the instance in the cell0 database
    and deletes the build request. At this point quota usage
    has been incremented in the main 'nova' database.

    When the instance is deleted, the build request is already gone
    so _delete_while_booting returns False and we lookup the instance
    in cell0 and delete it from there, but that flow wasn't decrementing
    quota usage like _delete_while_booting was.

    This change adds the same quota usage decrement handling that
    _delete_while_booting performs.

    NOTE(mriedem): This change also pulls in some things from
    I7de87dce216835729283bca69f0eff59a679b624 which is not being
    backported to Ocata since in Pike it solves a slightly different
    part of this quota usage issue. In Pike the cell mapping db_connection
    is actually stored on the context object when we get the instance
    from nova.compute.api.API.get(). So the fix in Pike is slightly
    different from Ocata. However, what we need to pull from that Pike
    change is:

    1. We need to target the cell that the instance lives in to get the
       flavor information when creating the quota reservation.

    2. We need to change the functional regression test to assert that
       the bug is fixed.

    The code and tests are adjusted to be a sort of mix between both
    changes in Pike without requiring a full backport of the 2nd
    part of the fix in Pike.

    Change-Id: I4cb0169ce0de537804ab9129bc671d75ce5f7953
    Partial-Bug: #1670627
    (cherry picked from commit 018068c4caac324643c7c6a4360fad855dd096eb)

Revision history for this message
krims0n (krims0n32) wrote :

Applied this patch, now when deleting an instance in error state and the delete fails for some reason, the instance count is still decremented. This does not seem right to me.

Revision history for this message
Matt Riedemann (mriedem) wrote :

@krims0n: are you on stable/ocata or master (pike)? When the instance.destroy() fails we do a rollback of the decremented quota, here:

https://review.openstack.org/#/c/445236/1/nova/compute/api.py@1816

That's on Ocata. Are you hitting something besides an InstanceNotFound? That would probably cause an issue and we wouldn't rollback the quota change.

Please report a new bug with details on what the failure was.

Revision history for this message
melanie witt (melwitt) wrote :

I think the problem is that we're doing the quotas.commit() immediately in both of our cells v2 local delete cases (_delete_while_booting and _delete). I think the quotas.rollback() won't have any effect if quotas.commit() already happened. They are supposed to be mutually exclusive -- either commit() or rollback() after the reserve().

Revision history for this message
Matt Riedemann (mriedem) wrote :

OK, now that I look closer at the pre-cellsv2 _delete code in the compute API, we handle decrementing quota usage during local delete like this:

1. Create the reservation for the usage decrement.
2. Attempt to delete the instance.
2.1 If the instance was deleted, we commit the reservation.
2.2 If the instance delete failed (like it's already deleted), then we rollback the reservation.

We don't do that in the _delete_while_building or cell0 case in _delete.

We should probably have a new bug for this since this one is closed.

Revision history for this message
krims0n (krims0n32) wrote :

@mriedem: Yes, I am on Ocata and hitting something besides an InstanceNotFound. I tested by configuring a non-existing ceph pool in nova.conf, create instance which ofcourse errors, then delete that instance. I guess stopping the nova-compute service would have the same effect.

Revision history for this message
Matt Riedemann (mriedem) wrote :

You'd still go down the same "local delete" path in the compute API code since you didn't get the instance built on a compute host, so instance.host will be None, but it will bypass the _delete_while_building code, and should get here:

https://github.com/openstack/nova/blob/stable/ocata/nova/compute/api.py#L1813

Unless instance.host was set, but we should unset that on a failed spawn in the compute:

https://github.com/openstack/nova/blob/stable/ocata/nova/compute/manager.py#L1795

If instance.host is set, then we should bypass all of that code and get to the point where the delete happens on the compute node, which is the legacy and normal delete flow.

At this point I think we need recreate steps and logs from n-api, n-cond and n-cpu when this fails so I can see exactly where things are breaking down. Please open a new bug and provide those details.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I've created this bug for the issue reported in comment #23.

https://bugs.launchpad.net/nova/+bug/1678326

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.0.0b1

This issue was fixed in the openstack/nova 16.0.0.0b1 development milestone.

Revision history for this message
Shannon McFarland (shmcfarl) wrote :

Is there still work going on to get this fix into Ocata?

Revision history for this message
Matt Riedemann (mriedem) wrote :

No this should be resolved for Ocata also, the same changes on master (pike) were backported to stable/ocata and released.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Should be fixed in Ocata 15.0.1.

Revision history for this message
Deepa (dpaclt) wrote :

Hello Matt

  We are having Ocata 15.0.2 and still having this issue .Even after error instance deletion quotas is not getting reflected with it

Revision history for this message
Matt Riedemann (mriedem) wrote :

With some more details or logs or debug on the part of the people reporting this is still an issue on ocata, even though fixes have been backported and released for ocata, I can't really help here.

Revision history for this message
Matt Riedemann (mriedem) wrote :

*Without some more details

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.