Quick update; Gavin, Jason and I have been working to debug this. Through crude bisecting, we've isolated the problem (at least with release()) to the call to power_off_nodes() in NodeManager.stop_nodes().
More precisely, if call_power_command() in power_nodes() (src/maasserver/clusterrpc/power.py) returns early rather than calling the Power(Off|On) RPC command, all the nodes will release just hunky dory. If the RPC call goes through, roughly half of the nodes will remain Allocated. We've also see errors in the log for PowerActionAlreadyInProgress for some nodes, though we don't know yet whether that's spurious or not.
Gavin and I are leaving to grab dinner now; we'll pick this up later.
Quick update; Gavin, Jason and I have been working to debug this. Through crude bisecting, we've isolated the problem (at least with release()) to the call to power_off_nodes() in NodeManager. stop_nodes( ).
More precisely, if call_power_ command( ) in power_nodes() (src/maasserver /clusterrpc/ power.py) returns early rather than calling the Power(Off|On) RPC command, all the nodes will release just hunky dory. If the RPC call goes through, roughly half of the nodes will remain Allocated. We've also see errors in the log for PowerActionAlre adyInProgress for some nodes, though we don't know yet whether that's spurious or not.
Gavin and I are leaving to grab dinner now; we'll pick this up later.