Redfish power driver reports twisted errors on retries
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Committed
|
Medium
|
Jacopo Rota |
Bug Description
Describe the bug:
Feedback originally added to the various bug reports that introduced retry logic into the Redfish driver, but mostly point #3:
1. The Try/Retry counting is off
"This is the try number 0 out of 6"
"This is the try number 4 out of 6", then "Maximum number of retries reached"
MAX_REQUEST_RETRIES = 5
Why isn't MAX_REQUEST_RETRIES used in the log entry? It's hard coded to '6'.
2. Why does the code retry on ANY error in redfish_request, not just on returning transitional power statuses?
Could be get_etag, get_node_id, set_pxe_boot, or any power control, not just power_query
Could be permissions error or any other type of error.
So why retry a permissions error 6 times? It's still going to fail every time, and just waste time.
3. It doesn't appear to send back the error after the first instance, so it's hard to tell from the log what's actually going on:
maas.drivers.
maas.drivers.
What file? Who closed it? raise_error?
4. Spelling/grammar
"This is the try number..."
"Retring after %f seconds."
Steps to reproduce:
1. Reduce permissions on BMC account used for power control so that it can't actually power on or can't set PXE next boot
2. Attempt a MAAS action that requires a power action like Commission, Deploy, or Release with disk erase
Expected behavior (what should have happened?):
Power control error reported in machine event logs
Actual behavior (what actually happened?):
twisted.
Must go to region controller maas.log to find actual error (which is only reported the first time. Only the result of the last try is reported to the machine event log).
MAAS version and installation type (deb, snap):
deb version 3.3.9 and 3.4.5
MAAS setup (HA, single node, multiple regions/racks):
single region controller, multiple rack controllers - two per data center in HA pairs
Host OS distro and version:
ubuntu 22.04 jammy
Additional context:
Example of a Commission failure where the machine actually powered on, but failed out with this retry logic:
Tue, 03 Dec. 2024 08:22:42 TFTP Request - grubx64.efi
Tue, 03 Dec. 2024 08:20:08 Failed to power on node - Power on for the node failed: Failed talking to node's BMC: [<twisted.
Tue, 03 Dec. 2024 08:20:08 Node changed status - From 'Commissioning' to 'Failed commissioning'
Tue, 03 Dec. 2024 08:20:08 Marking node failed - Power on for the node failed: Failed talking to node's BMC: [<twisted.
Tue, 03 Dec. 2024 08:19:16 Powering on
Tue, 03 Dec. 2024 08:19:16 Node - Started commissioning on '88TN0R3'.
Related branches
- MAAS Lander: Needs Fixing
- Anton Troyanov: Approve
-
Diff: 155 lines (+88/-10)2 files modifiedsrc/provisioningserver/drivers/power/redfish.py (+35/-9)
src/provisioningserver/drivers/power/tests/test_redfish.py (+53/-1)
Changed in maas: | |
status: | New → Triaged |
importance: | Undecided → Low |
milestone: | none → 3.6.x |
Changed in maas: | |
status: | In Progress → Fix Committed |
As for
> Why does the code retry on ANY error in redfish_request, not just on returning transitional power statuses?
Because it's a much more reliable solution. I do understand that for 401 or 403 we can error out immediately, so it's a good improvement to implement.
> It doesn't appear to send back the error after the first instance, so it's hard to tell from the log what's actually going on:
We can try to improve logging.