Redfish power driver reports twisted errors on retries

Bug #2091979 reported by Paul Jonason
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Committed
Medium
Jacopo Rota

Bug Description

Describe the bug:

Feedback originally added to the various bug reports that introduced retry logic into the Redfish driver, but mostly point #3:

1. The Try/Retry counting is off
"This is the try number 0 out of 6"
"This is the try number 4 out of 6", then "Maximum number of retries reached"
MAX_REQUEST_RETRIES = 5
Why isn't MAX_REQUEST_RETRIES used in the log entry? It's hard coded to '6'.

2. Why does the code retry on ANY error in redfish_request, not just on returning transitional power statuses?
Could be get_etag, get_node_id, set_pxe_boot, or any power control, not just power_query
Could be permissions error or any other type of error.
So why retry a permissions error 6 times? It's still going to fail every time, and just waste time.

3. It doesn't appear to send back the error after the first instance, so it's hard to tell from the log what's actually going on:

maas.drivers.power.redfish: [info] Power action failure: Redfish request failed with response status code: 403.. This is the try number 0 out of 6.
maas.drivers.power.redfish: [info] Power action failure: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]. This is the try number 1 out of 6.

What file? Who closed it? raise_error?

4. Spelling/grammar
"This is the try number..."
"Retring after %f seconds."

Steps to reproduce:
1. Reduce permissions on BMC account used for power control so that it can't actually power on or can't set PXE next boot
2. Attempt a MAAS action that requires a power action like Commission, Deploy, or Release with disk erase

Expected behavior (what should have happened?):
Power control error reported in machine event logs

Actual behavior (what actually happened?):
twisted.python.failure error reported in machine event log after 6 attempts at power control
Must go to region controller maas.log to find actual error (which is only reported the first time. Only the result of the last try is reported to the machine event log).

MAAS version and installation type (deb, snap):
deb version 3.3.9 and 3.4.5

MAAS setup (HA, single node, multiple regions/racks):
single region controller, multiple rack controllers - two per data center in HA pairs

Host OS distro and version:
ubuntu 22.04 jammy

Additional context:
Example of a Commission failure where the machine actually powered on, but failed out with this retry logic:

Tue, 03 Dec. 2024 08:22:42 TFTP Request - grubx64.efi
 Tue, 03 Dec. 2024 08:20:08 Failed to power on node - Power on for the node failed: Failed talking to node's BMC: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]
 Tue, 03 Dec. 2024 08:20:08 Node changed status - From 'Commissioning' to 'Failed commissioning'
 Tue, 03 Dec. 2024 08:20:08 Marking node failed - Power on for the node failed: Failed talking to node's BMC: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]
 Tue, 03 Dec. 2024 08:19:16 Powering on
 Tue, 03 Dec. 2024 08:19:16 Node - Started commissioning on '88TN0R3'.

Related branches

Jacopo Rota (r00ta)
Changed in maas:
status: New → Triaged
importance: Undecided → Low
milestone: none → 3.6.x
Revision history for this message
Jacopo Rota (r00ta) wrote :

As for

> Why does the code retry on ANY error in redfish_request, not just on returning transitional power statuses?

Because it's a much more reliable solution. I do understand that for 401 or 403 we can error out immediately, so it's a good improvement to implement.

> It doesn't appear to send back the error after the first instance, so it's hard to tell from the log what's actually going on:

We can try to improve logging.

Revision history for this message
Jacopo Rota (r00ta) wrote :

Also, can you clarify how the retry logic is involved here?

Tue, 03 Dec. 2024 08:22:42 TFTP Request - grubx64.efi
 Tue, 03 Dec. 2024 08:20:08 Failed to power on node - Power on for the node failed: Failed talking to node's BMC: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]
 Tue, 03 Dec. 2024 08:20:08 Node changed status - From 'Commissioning' to 'Failed commissioning'
 Tue, 03 Dec. 2024 08:20:08 Marking node failed - Power on for the node failed: Failed talking to node's BMC: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]
 Tue, 03 Dec. 2024 08:19:16 Powering on
 Tue, 03 Dec. 2024 08:19:16 Node - Started commissioning on '88TN0R3'.

I don't see the log messages about the retries.

Changed in maas:
assignee: nobody → Jacopo Rota (r00ta)
milestone: 3.6.x → 3.6.0
importance: Low → Medium
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Paul Jonason (pjonason) wrote :

Sorry. The original post of log entries is from the machine's event log, such as is found in the UI. Here is the corresponding section from maas.log:

2024-12-03T08:19:16.730805+00:00 controller maas.node: [info] 88TN0R3: Status transition from READY to COMMISSIONING
2024-12-03T08:19:16.800154+00:00 controller maas.power: [info] Changing power state (on) of node: 88TN0R3 (834en4)
2024-12-03T08:19:16.800872+00:00 controller maas.node: [info] 88TN0R3: Commissioning started
2024-12-03T08:19:39.692941+00:00 controller maas.drivers.power.redfish: [info] Power action failure: Redfish request failed with response status code: 500.. This is the try number 0 out of 6.
2024-12-03T08:19:40.196146+00:00 controller maas.drivers.power.redfish: [info] Power action failure: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]. This is the try number 1 out of 6.
2024-12-03T08:19:41.699272+00:00 controller maas.drivers.power.redfish: [info] Power action failure: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]. This is the try number 2 out of 6.
2024-12-03T08:19:45.202266+00:00 controller maas.drivers.power.redfish: [info] Power action failure: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]. This is the try number 3 out of 6.
2024-12-03T08:19:52.705952+00:00 controller maas.drivers.power.redfish: [info] Power action failure: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]. This is the try number 4 out of 6.
2024-12-03T08:20:08.211344+00:00 controller maas.drivers.power.redfish: [error] Maximum number of retries reached. Giving up!
2024-12-03T08:20:08.211583+00:00 controller maas.power: [error] Error changing power state (on) of node: 88TN0R3 (834en4)
2024-12-03T08:20:08.225207+00:00 controller maas.node: [info] 88TN0R3: Status transition from COMMISSIONING to FAILED_COMMISSIONING
2024-12-03T08:20:08.247266+00:00 controller maas.node: [error] 88TN0R3: Marking node failed: Power on for the node failed: Failed talking to node's BMC: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]

So the only place the actual error is reported is here in maas.log on the first line ("Redfish request failed with response status code: 500"). This information is not recorded in the machine event log. The machine event log only records a scary, confusing generic error about I/O operation on a closed file.

A technician trying to troubleshoot the error may only have access to the UI, and might not have access to log in to the controller to peruse the maas.log. That might require elevation to the next level of engineer and a delay in finding a solution. They might also try troubleshooting twisted, which isn't really related to the underlying issue, and again will delay a solution.

Revision history for this message
Jacopo Rota (r00ta) wrote :

by heart, `[<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]` means that we started reading from the socket but the connection was dropped because of something.

We should add some debug logs to better track the real problem and not the symptom.

Revision history for this message
Jacopo Rota (r00ta) wrote :

@pjonason is there any chance for you to monitor what iLO is actually doing when there are such failures? I suspect it is somehow restarting so the responses we get are somehow unpredictable.

I need to check if there is a live probe endpoint that we can use to ensure that the BMC up and ready to accept requests. Otherwise the results might be unpredictable.

Revision history for this message
Paul Jonason (pjonason) wrote :

FYI, what we mostly deal with is Dell iDRACs, though we have a smattering of different vendors in our data centers. Rarely would a BMC itself be restarting during a MAAS action in our environment, but we have seen iDRACs be slow to respond to commands at times, and we have increased timeouts in various files to account for it.

Our "typical" scenario when these events occur is a 409 code, when the BMC is either already sending the requested power command to the system, or it's already in the state requested (on or off), and "wires get crossed", as they say.

Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):

Oh, I assumed you were using an HP Gen10 machine as that was the original root of the problems.

So do you confirm the logs you posted are from a Dell machine?

Revision history for this message
Paul Jonason (pjonason) wrote :

Yes, indeed, the logs I posted above are from a Dell machine. We have experienced similar on an HPE Gen10, however. Here's a more recent example from an HPE:

2024-12-12T12:44:59-06:00 rack01 maas.drivers.power.redfish: [info] Power action failure: [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]. This is the try number 0 out of 6.
2024-12-12T12:45:07-06:00 rack02 maas.drivers.power.redfish: [info] Power action failure: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]. This is the try number 1 out of 6.
2024-12-12T12:45:08-06:00 rack02 maas.drivers.power.redfish: [info] Power action failure: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]. This is the try number 2 out of 6.
2024-12-12T12:45:12-06:00 rack02 maas.drivers.power.redfish: [info] Power action failure: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]. This is the try number 3 out of 6.
2024-12-12T12:45:19-06:00 rack02 maas.drivers.power.redfish: [info] Power action failure: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]. This is the try number 4 out of 6.
2024-12-12T12:45:35.513666-06:00 region01 maas.node: [error] 2M291603TW: Marking node failed: Power on for the node failed: Failed talking to node's BMC: [<twisted.python.failure.Failure builtins.ValueError: I/O operation on closed file.>]

Now it switched rack controllers between try 0 and try 1 here, but this was not the case on the earlier logs I sent. Those occurred in a MAAS instance with only one combined region+rack controller.

Revision history for this message
Jacopo Rota (r00ta) wrote :

I created another dedicated bug https://bugs.launchpad.net/maas/+bug/2092172

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.