Subnet changed to wrong fabric, impacting DHCP

Bug #2031482 reported by Gregory Orange
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Medium
Unassigned

Bug Description

Context:
2x rackd, 1x regiond
Ubuntu 22.04, packages, 1:3.3.3-13184-g.3e9972c19-0ubuntu1~22.04.1

https://discourse.maas.io/t/why-did-a-subnet-change-to-a-different-fabric/7327 describes the issue, and therein we were asked to submit this bug report.

After DHCP stopped working on a subnet served by our second rackd controller, we discovered that that subnet had somehow been moved into the wrong fabric. This meant that `dhcpd-interfaces` no longer had the relevant interface (a bridge named admin), and dhcpd.conf had no reference to that subnet.

When we moved the subnet back to the correct fabric, it all started working again as expected.

We did not look at dhcpd.conf on the first rackd controller, which houses that other fabric. The subnet in that other fabric continued to work fine during this time.

We do not know how the subnet was reconfigured. Only three people here know how to do that, and all know enough not to make the 6 or so clicks required to make the change in the web UI. We do not easily know how to do it via the CLI.

Soon I will attach logs from the times in question. From the virsh errors in the logs we believe that the problem began at or very close to 2023-08-09T11:17:31.421297+08:00. Note UTC+8 on second rackd controller and regiond controller, while first rackd controller is UTC.

Revision history for this message
Gregory Orange (gregoryo2017) wrote (last edit ):

It was around 2023-08-14T09:21:36.043193+08:00 when we resolved the issue, when the first virsh state changed `Power state has changed from error to on.`

This (and the errors that preceded it) is because the KVM Pod nodes hosting these machines lost the IP on their admin interface as part of the DHCP outage, then regained it once it was fixed.

Revision history for this message
Gregory Orange (gregoryo2017) wrote :

At 85MB gzipped, I have uploaded the log files to Google Drive covering the time period. I gave each file a prefix as to whether it was rackd1, rackd2 or region controller.

https://drive.google.com/file/d/1BRVhrKQLsaxR8rhk1ne3lfQ5feBfEYr7/view?usp=sharing

Let me know if you would prefer that I upload the file(s) directly here.

Revision history for this message
Gregory Orange (gregoryo2017) wrote :

Is it somehow relevant that it was fabric 0 that the subnet was erroneously moved to? Some kind of error and default selection to the first one?

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

This exception is repeating in the regiond logs:

2023-08-13 00:00:47 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 293, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 62, in __call__
    response = super().__call__(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 197, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 195, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 370, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 860, in signal
    target_status = process(node, request, status)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 682, in _process_commissioning
    self._store_results(
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 565, in _store_results
    script_result.store_result(
  File "/usr/lib/python3/dist-packages/metadataserver/models/scriptresult.py", line 270, in store_result
    assert self.status in SCRIPT_STATUS_RUNNING_OR_PENDING
AssertionError

This is likely due to an error when fetching hardware info for the region, which if the region has an IP within the subnet in question, it can be erroneously moved for this reason. Can you confirm whether the region controller does in fact have an IP in this subnet?

Changed in maas:
status: New → Incomplete
Revision history for this message
Gregory Orange (gregoryo2017) wrote :

No, the region controller has only one IP, in subnet 28, fabric 20.

For comparison, the two rack controller have IPs in a range of subnets they serve, but not in subnet 28.

Revision history for this message
Gregory Orange (gregoryo2017) wrote :

It happened again, so here is some more context. DHCP leases on the subnet in question are set to 1 hour.

2023-09-06 AWST

11:39:39 We restarted maas-rackd daemon on the 2nd (acacia) rackd server
13:17:21 DHCP leases started to drop and we went about discovering the problem
13:45 We set subnet 63 from fabric 0 back to fabric 55 where it was before
13:47:41 DHCP leases began to be reissued

Here are the logs. I trimmed syslog a little to only show more recent entries.
https://drive.google.com/file/d/1_b7KueXlTPuayFv7u8KDbLbIpNDH6JLX/view?usp=drive_link
Remember that nimbus rackd1 logs are in UTC, while the others are in AWST UTC+8.

Revision history for this message
Gregory Orange (gregoryo2017) wrote :

It has just happened again, with no restarts to any daemons.

Revision history for this message
Gregory Orange (gregoryo2017) wrote :

We increased our DHCP lease (with a snippet) to 2 weeks while we are experiencing this issue, to reduce its impact. I have just discovered that it has occurred again.

Changed in maas:
status: Incomplete → New
Bill Wear (billwear)
Changed in maas:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 3.5.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.