Unreliable PXE-booting of Quanta S910-X31E

Bug #1690878 reported by Rod Smith
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Expired
Undecided
Unassigned

Bug Description

In doing 17.04 regression testing of a Quanta STRATOS S910-X31E server, I'm having problems PXE-booting the server for enlistment, commissioning, and deployment. The symptom is that the server (which is configured to boot in EFI mode) successfully downloads GRUB, but GRUB seems to be unable to retrieve its configuration and follow-on files from the MAAS server computer, resulting in a "grub>" prompt on the Quanta server's display. This does not happen on *EVERY* boot, though; on about 1 in 10 attempts, the server *WILL* successfully PXE-boot from the MAAS server, at which point the requested operation (enlistment, commissioning, or deployment [with caveats]) will succeed. Note that on deployment, the system must PXE-boot twice -- once to install the OS and again to reboot into it. This second boot produces the same type of failure as the first one, so the deployment is likely to fail. I have gotten it to deploy by watching the screen and resetting when the "grub>" prompt appears; if I'm lucky, the system will boot with few enough failures to avoid a timeout in MAAS.

I have attempted several workarounds, none of which has helped:

* I've reconfigured the system to boot in BIOS/CSM/legacy
  mode rather than EFI mode. The server seems to be unable
  to PXE-boot at all in this mode, though; it goes straight
  to the firmware setup utility. Perhaps I'm missing a key
  setting, though.
* I've configured the server's second (of two) network
  interface to enabled without PXE-booting. This has no
  effect.
* I've configured the server's second network interface
  to disabled. This does not help; and if done during
  commissioning, results in MAAS thinking the server has
  just one network device.
* I've tried quitting out of GRUB. When multiple network
  devices are enabled, this results in an attempt to PXE-
  boot from the second device, which produces the same
  failure as the first.

Although I discovered the problem with Ubuntu 17.04, I've tested with deployment of other versions (primarily 16.04), and that has not helped.

This server was successfully deployed and certified on November 3, 2016:

https://certification.canonical.com/certificates/1611-10128/

Thus, it appears that this problem is new.

Other servers I've been regression testing have not exhibited this problem, so I don't think it's a MAAS configuration option -- or if it is, it's something that's affecting just this one server (oil-moltres; MAC addresses 2c:60:0c:73:75:8c and 2c:60:0c:73:75:8d; BMC IP address 10.245.32.2). I'm attaching the /var/log/maas directory tree from the MAAS server. (Note that I'm taking that image during an ongoing deployment that has finally begun to work, but there were quite a few failures shortly before that.)

MAAS version information is:

$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-==============================-============-==================================================
ii maas 2.1.4+bzr5591-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cert-server 0.2.30-0~74~ubuntu16.04.1 all Ubuntu certification support files for MAAS server
ii maas-cli 2.1.4+bzr5591-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.1.4+bzr5591-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.1.4+bzr5591-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.1.4+bzr5591-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.1.4+bzr5591-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.1.4+bzr5591-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.1.4+bzr5591-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.1.4+bzr5591-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.1.4+bzr5591-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.1.4+bzr5591-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.1.4+bzr5591-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Revision history for this message
Rod Smith (rodsmith) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Since the machine works sometimes and some other times it doesn't... this seems like it would be a grub issue. That said, a few questions:

1. Since the certification date, to today. Has the firmware version changed ?

2. Can you please attach the detailed node event log for the machines when:
2.1. it was able to deploy/commission
2.2. it was unable to deploy/commission

3. Can you obtain the grub version from the console (if possible)?

Changed in maas:
status: New → Incomplete
Revision history for this message
Rod Smith (rodsmith) wrote :

AFAIK, the firmware in the node has not changed; however, it was in the care of OIL between then and now, so I can't make any promises about that. I don't see anything in the BMC's web-accessible UI about a firmware update, but I don't know if such a message would show up there. The current installed firmware is version 1.18, dated July 8, 2015. That's also the latest available from Quanta. We received the system in November of 2016 -- or at least, that's what the CID number indicates. (My guess is we took delivery in October or earlier.)

As to the event log, I've attached the entire /var/log/maas directory tree. If you need something not there, could you point me to instructions on how to obtain it?

The GRUB version displayed when the node boots is 2.02~beta2-36ubuntu3.9.

Revision history for this message
Rod Smith (rodsmith) wrote :

Could somebody take another look at this, please? It's a show-stopping problem on a certified system.

Changed in maas:
status: Incomplete → Confirmed
Revision history for this message
Andres Rodriguez (andreserl) wrote :

The information requested is not attached to the bug report, as such, this is marked as incomplete.

Additional information:

1. What's the latest BIOS version available from the manufacturer?
2. What's the BIOS version on the hardware?
3. Is the system configured to secure boot? If so, can you disable it and just use EFI mode without secure boot and report back?

Changed in maas:
status: Confirmed → Incomplete
Revision history for this message
Rod Smith (rodsmith) wrote :

Here's a log of a failed deployment:

  Queried node's BMC - Power state queried: on Fri, 09 Jun. 2017 16:20:52
 Node changed status - From 'Deploying' to 'Failed deployment' Fri, 09 Jun. 2017 16:20:32
 Marking node failed - Node operation 'Deploying' timed out after 40 minutes. Fri, 09 Jun. 2017 16:20:32
 TFTP Request - grubx64.efi Fri, 09 Jun. 2017 15:40:51
 TFTP Request - bootx64.efi Fri, 09 Jun. 2017 15:40:50
 TFTP Request - bootx64.efi Fri, 09 Jun. 2017 15:40:50
 Node powered on Fri, 09 Jun. 2017 15:39:52
 Powering node on Fri, 09 Jun. 2017 15:39:48
 User starting deployment - (rodsmith) Fri, 09 Jun. 2017 15:39:48
 User acquiring node - (rodsmith) Fri, 09 Jun. 2017 15:39:48
 Node powered off Fri, 09 Jun. 2017 15:39:42
 Node changed status - From 'Releasing' to 'Ready' Fri, 09 Jun. 2017 15:39:42
 Powering node off Fri, 09 Jun. 2017 15:39:37
 User releasing node - (rodsmith) Fri, 09 Jun. 2017 15:39:37
 TFTP Request - grubx64.efi Fri, 09 Jun. 2017 15:37:19
 TFTP Request - bootx64.efi Fri, 09 Jun. 2017 15:37:18
 TFTP Request - bootx64.efi Fri, 09 Jun. 2017 15:37:18
 TFTP Request - grubx64.efi Fri, 09 Jun. 2017 15:35:05
 TFTP Request - bootx64.efi Fri, 09 Jun. 2017 15:35:04
 TFTP Request - bootx64.efi Fri, 09 Jun. 2017 15:35:04
 Node powered on Fri, 09 Jun. 2017 15:34:06
 Powering node on Fri, 09 Jun. 2017 15:34:02
 User starting deployment - (rodsmith) Fri, 09 Jun. 2017 15:34:02
 User acquiring node - (rodsmith) Fri, 09 Jun. 2017 15:34:02
 Node powered off Thu, 08 Jun. 2017 17:48:14
 Node changed status - From 'Releasing' to 'Ready' Thu, 08 Jun. 2017 17:48:14
 Powering node off Thu, 08 Jun. 2017 17:48:10
 User releasing node - (rodsmith) Thu, 08 Jun. 2017 17:48:10
 TFTP Request - grubx64.efi Thu, 08 Jun. 2017 17:47:09
 TFTP Request - bootx64.efi Thu, 08 Jun. 2017 17:47:08
 TFTP Request - bootx64.efi Thu, 08 Jun. 2017 17:47:08
 Node powered on Thu, 08 Jun. 2017 17:46:09
 Powering node on Thu, 08 Jun. 2017 17:46:05
 User starting deployment - (rodsmith) Thu, 08 Jun. 2017 17:46:04
 User acquiring node - (rodsmith) Thu, 08 Jun. 2017 17:46:04

The Secure Boot setting does NOT affect the problem -- the same issue occurs whether Secure Boot is enabled or not.

Concerning the firmware, as stated above:

> The current installed firmware is version 1.18, dated July 8, 2015. That's
> also the latest available from Quanta.

Jeff Lane  (bladernr)
Changed in maas:
status: Incomplete → Confirmed
Revision history for this message
Jeff Lane  (bladernr) wrote :

Hey Rod,

Is this an issue with the latest regression run for 17.10?

Changed in maas:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.