Unreliable PXE-booting of Quanta S910-X31E
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Expired
|
Undecided
|
Unassigned |
Bug Description
In doing 17.04 regression testing of a Quanta STRATOS S910-X31E server, I'm having problems PXE-booting the server for enlistment, commissioning, and deployment. The symptom is that the server (which is configured to boot in EFI mode) successfully downloads GRUB, but GRUB seems to be unable to retrieve its configuration and follow-on files from the MAAS server computer, resulting in a "grub>" prompt on the Quanta server's display. This does not happen on *EVERY* boot, though; on about 1 in 10 attempts, the server *WILL* successfully PXE-boot from the MAAS server, at which point the requested operation (enlistment, commissioning, or deployment [with caveats]) will succeed. Note that on deployment, the system must PXE-boot twice -- once to install the OS and again to reboot into it. This second boot produces the same type of failure as the first one, so the deployment is likely to fail. I have gotten it to deploy by watching the screen and resetting when the "grub>" prompt appears; if I'm lucky, the system will boot with few enough failures to avoid a timeout in MAAS.
I have attempted several workarounds, none of which has helped:
* I've reconfigured the system to boot in BIOS/CSM/legacy
mode rather than EFI mode. The server seems to be unable
to PXE-boot at all in this mode, though; it goes straight
to the firmware setup utility. Perhaps I'm missing a key
setting, though.
* I've configured the server's second (of two) network
interface to enabled without PXE-booting. This has no
effect.
* I've configured the server's second network interface
to disabled. This does not help; and if done during
commissioning, results in MAAS thinking the server has
just one network device.
* I've tried quitting out of GRUB. When multiple network
devices are enabled, this results in an attempt to PXE-
boot from the second device, which produces the same
failure as the first.
Although I discovered the problem with Ubuntu 17.04, I've tested with deployment of other versions (primarily 16.04), and that has not helped.
This server was successfully deployed and certified on November 3, 2016:
https:/
Thus, it appears that this problem is new.
Other servers I've been regression testing have not exhibited this problem, so I don't think it's a MAAS configuration option -- or if it is, it's something that's affecting just this one server (oil-moltres; MAC addresses 2c:60:0c:73:75:8c and 2c:60:0c:73:75:8d; BMC IP address 10.245.32.2). I'm attaching the /var/log/maas directory tree from the MAAS server. (Note that I'm taking that image during an ongoing deployment that has finally begun to work, but there were quite a few failures shortly before that.)
MAAS version information is:
$ dpkg -l '*maas*'|cat
Desired=
| Status=
|/ Err?=(none)
||/ Name Version Architecture Description
+++-===
ii maas 2.1.4+bzr5591-
ii maas-cert-server 0.2.30-
ii maas-cli 2.1.4+bzr5591-
un maas-cluster-
ii maas-common 2.1.4+bzr5591-
ii maas-dhcp 2.1.4+bzr5591-
ii maas-dns 2.1.4+bzr5591-
ii maas-proxy 2.1.4+bzr5591-
ii maas-rack-
ii maas-region-api 2.1.4+bzr5591-
ii maas-region-
un maas-region-
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-
ii python3-django-maas 2.1.4+bzr5591-
ii python3-maas-client 2.1.4+bzr5591-
ii python3-
Changed in maas: | |
status: | Incomplete → Confirmed |
Since the machine works sometimes and some other times it doesn't... this seems like it would be a grub issue. That said, a few questions:
1. Since the certification date, to today. Has the firmware version changed ?
2. Can you please attach the detailed node event log for the machines when:
2.1. it was able to deploy/commission
2.2. it was unable to deploy/commission
3. Can you obtain the grub version from the console (if possible)?