Bug #2031482 “Subnet changed to wrong fabric, impacting DHCP” : Bugs : MAAS

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-08-16 (last edit on 2023-08-16):

#1

It was around 2023-08-14T09:21:36.043193+08:00 when we resolved the issue, when the first virsh state changed `Power state has changed from error to on.`

This (and the errors that preceded it) is because the KVM Pod nodes hosting these machines lost the IP on their admin interface as part of the DHCP outage, then regained it once it was fixed.

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-08-16:

#2

At 85MB gzipped, I have uploaded the log files to Google Drive covering the time period. I gave each file a prefix as to whether it was rackd1, rackd2 or region controller.

https://drive.google.com/file/d/1BRVhrKQLsaxR8rhk1ne3lfQ5feBfEYr7/view?usp=sharing

Let me know if you would prefer that I upload the file(s) directly here.

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-08-18:

#3

Is it somehow relevant that it was fabric 0 that the subnet was erroneously moved to? Some kind of error and default selection to the first one?

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2023-09-05:

#4

This exception is repeating in the regiond logs:

2023-08-13 00:00:47 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 293, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 62, in __call__
    response = super().__call__(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 197, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 195, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 370, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 860, in signal
    target_status = process(node, request, status)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 682, in _process_commissioning
    self._store_results(
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 565, in _store_results
    script_result.store_result(
  File "/usr/lib/python3/dist-packages/metadataserver/models/scriptresult.py", line 270, in store_result
    assert self.status in SCRIPT_STATUS_RUNNING_OR_PENDING
AssertionError

This is likely due to an error when fetching hardware info for the region, which if the region has an IP within the subnet in question, it can be erroneously moved for this reason. Can you confirm whether the region controller does in fact have an IP in this subnet?

This exception is repeating in the regiond logs:

2023-08-13 00:00:47 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 293, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 62, in __call__
    response = super().__call__(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 197, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 195, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 370, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 860, in signal
    target_status = process(node, request, status)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 682, in _process_commissioning
    self._store_results(
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 565, in _store_results
    script_result.store_result(
  File "/usr/lib/python3/dist-packages/metadataserver/models/scriptresult.py", line 270, in store_result
    assert self.status in SCRIPT_STATUS_RUNNING_OR_PENDING
AssertionError

This is likely due to an error when fetching hardware info for the region, which if the region has an IP within the subnet in question, it can be erroneously moved for this reason. Can you confirm whether the region controller does in fact have an IP in this subnet?

Changed in maas:
status:	New → Incomplete

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-09-06:

#5

No, the region controller has only one IP, in subnet 28, fabric 20.

For comparison, the two rack controller have IPs in a range of subnets they serve, but not in subnet 28.

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-09-06:

#6

It happened again, so here is some more context. DHCP leases on the subnet in question are set to 1 hour.

2023-09-06 AWST

11:39:39 We restarted maas-rackd daemon on the 2nd (acacia) rackd server
13:17:21 DHCP leases started to drop and we went about discovering the problem
13:45 We set subnet 63 from fabric 0 back to fabric 55 where it was before
13:47:41 DHCP leases began to be reissued

Here are the logs. I trimmed syslog a little to only show more recent entries.
https://drive.google.com/file/d/1_b7KueXlTPuayFv7u8KDbLbIpNDH6JLX/view?usp=drive_link
Remember that nimbus rackd1 logs are in UTC, while the others are in AWST UTC+8.

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-09-06:

#7

It has just happened again, with no restarts to any daemons.

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-09-18:

#8

We increased our DHCP lease (with a snippet) to 2 weeks while we are experiencing this issue, to reduce its impact. I have just discovered that it has occurred again.

Changed in maas:
status:	Incomplete → New

Bill Wear (billwear) on 2023-09-28

Changed in maas:
status:	New → Triaged
importance:	Undecided → Medium
milestone:	none → 3.5.0

MAAS

Subnet changed to wrong fabric, impacting DHCP

Bug Description

Other bug subscribers

Remote bug watches