boot device API blocks while waiting on the BMC

Bug #1427923 reported by aeva black
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Triaged
Medium
bin Yu

Bug Description

Ironic's REST API blocks while waiting on the BMC when getting or setting the boot device.

This is because ironic/api/controllers/v1/node.py BootDeviceController is making , both for PUT or GET requests, is calling the conductor, and the conductor's set_boot_device and get_boot_device methods are directly calling the driver's management interface.

I've said it before, but we should never have synchronous calls from the REST API to a BMC, because BMC's are slow and tend to break.

ConductorManager.set_boot_device should use a worker_thread(), and the API should probably return 202 instead of 204 for this.

ConductorManager.get_boot_device should be handled by a periodic task, and the result cached on the node somewhere, so the API request just fetches from the database. Or it should do something else that makes it asynchronous.

aeva black (tenbrae)
Changed in ironic:
status: New → Confirmed
importance: Undecided → Medium
Changed in ironic:
assignee: nobody → Zhenzan Zhou (zhenzan-zhou)
status: Confirmed → In Progress
Changed in ironic:
assignee: Zhenzan Zhou (zhenzan-zhou) → Devananda van der Veen (devananda)
Revision history for this message
Anusha (anusha-iiitm) wrote :

@Deva, Are you working on this bug? otherwise I could get started on it.

Revision history for this message
Dmitry Tantsur (divius) wrote :

Let me but in my thoughts about. I assume that the problem is not in API taking long to respond. There are two actualy problems:
1. API takes *undefined* amount of time to respond in case of BMC lock up
2. We block our messaging layer while we're waiting for the call to finish by using a sync AMQP call

So I don't think that a solution is just to make API asynchronous. It's a breaking change and one more step in making our errors hard to track. We already to this with power state, but at least power state in an inherently long procedure. Setting boot device should not - unless it breaks or hangs. If you just make API sync, you'll break inspector. Let me show. Inspector currently does roughly this:

 set_boot_device(uuid, 'pxe')
 set_power_state(uuid, 'reboot')

If we allow the former to go async and silently fail, we need to make the latter to fail as well. Otherwise inspector will report success, but actually node won't even try inspection (e.g. if it was set to local boot).

So what about making this call async on the conductor level, but sync on API level? I.e. API waiting for some kind of notification from conductor for some (probably settable timeout)?

Revision history for this message
Dmitry Tantsur (divius) wrote :

Hi Deva! Are you working on this bug? Please make sure to update the bug status accordingly. Thanks!

aeva black (tenbrae)
Changed in ironic:
assignee: Devananda van der Veen (devananda) → nobody
Dmitry Tantsur (divius)
Changed in ironic:
status: In Progress → Confirmed
bin Yu (froyo-bin)
Changed in ironic:
assignee: nobody → bin Yu (froyo-bin)
Revision history for this message
bin Yu (froyo-bin) wrote :

Hi, for the time-consuming and may failed operations, could we cache its operations and use a configuration wait-time to ensure the API have the maximum response time.?

for the operations like get/set_boot_device,

for get_boot_device:
how about we store the current boot device in node's driver field? every time when we call set_boot_device, we update this value in cache?

For set_boot_device:

whether we can set a maximum wait time, if we cannot get success response from hardware in the given time, we return 202 to notify the user?

Revision history for this message
Ruby Loo (rloo) wrote :

We discussed this in our weekly ironic meeting, on 2016-09-17 [1]. We agree that async is the way to go, but Dmitry had suggested async at the conductor level, but sync at the API level. I was going to start a discussion in the ML about this, but decided the best/fastest way was to discuss with Dmitry [2], and he is good with async. The CLI can have a --wait option. And of course, a bump in microversion.

[1] starting at 17:40:17, http://eavesdrop.openstack.org/meetings/ironic/2016/ironic.2016-09-19-17.00.log.html
[2] http://eavesdrop.openstack.org/irclogs/%23openstack-ironic/%23openstack-ironic.2016-10-12.log.html#t2016-10-12T14:54:04

Revision history for this message
Ruby Loo (rloo) wrote :

wrt questions in #4.

get_boot_device: we cannot/should not store the info in the node's driver_info field; that field is available for the user to specify driver-related information. Without thinking too hard about this, I think we should either add a new field, or store it in driver_internal_info (my preference).

for set_boot_device: I'd suggest taking a look at the other async calls we have, and see what config options we've added for handling them. We should do a similar thing here (if there seems to be some consistent way we're handling this).

Changed in ironic:
status: Confirmed → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by Sam Betts (<email address hidden>) on branch: master
Review: https://review.openstack.org/161046
Reason: Review >1.5 years old and is in merge conflict, so we are abandoning this for now. Feel free to reactivate the review by pressing the restore button if you want to rebase and continue this work.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.