regiond at 100% CPU after UI reconnect causing API errors
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Committed
|
High
|
Peter Makowski | ||
3.4 |
Fix Committed
|
High
|
Unassigned | ||
3.5 |
Fix Released
|
High
|
Unassigned |
Bug Description
Version and build:
=======
snap list maas
Name Version Rev Tracking Publisher Notes
maas 3.4.2-14353-
Interface Used:
=======
UI/API
I believe use of the UI is integral to creating the problem. Once created, API usage can be disrupted.
What happened:
=======
When the problem manifests, a python process on the MaaS server will peg at 100% CPU utilization for an extended period and postgres activity will be noticeably increased. The python process corresponds to regiond. The excessive utilization will disrupt other MaaS requests inducing intermittent API request timeouts and errors.
This problem originally presented on our production server and it seemed as though regiond was in some kind of infinite retry loop, querying the database as fast as possible, indefinitely. Once hung, the regiond process would remain in this state for days causing intermittent errors until MaaS services were restarted. Things would then behave normally for days or a week, and the problem would manifest again.
After the problem was reproduced on an isolated unused MaaS server with the TestDB, it appears as though the regiond CPU utilization will eventually fall down from 100% to a more normal value, though database usage still appears to be elevated above normal.
I have added the following additional debugging flags to regiond.conf in an attempt to diagnose further.
cat /var/snap/
...
debug: True
debug_queries: True
debug_http: True
I have done some review of the MaaS code, and although my knowledge about how all the pieces work together is increasing,there is still a lot about the websocket usage, handlers and notifications that I don't fully understand. I suspect that the increased CPU may be due to an improper UI re-connection strategy that allows multiple websocket connections to be created to service one client UI. Multiple connections from the same client compounded by the general inefficiency of django/
I suspect this is the cause for two main reasons:
1) The Firefox Web Developer Tools shows multiple /ws URL GET requests outstanding simultaneously. In the test scenario above, it is not uncommon to see a dozen outstanding GET requests at a time, before any start failing with NS_ERROR_
2) The regiond.log shows multiple near simultaneous connection openings at the time shortly after network connectivity is restored even though only one browser tab/URL access was involved. My limited understanding of the MaaS code seems to make me think this will result in unnecessary backend traffic and server utilization.
cat /var/snap/
2024-06-24 17:36:40 maasserver.
2024-06-24 17:36:41 maasserver.
2024-06-24 17:36:41 maasserver.
2024-06-24 17:36:42 maasserver.
2024-06-24 17:36:42 maasserver.
2024-06-24 17:36:42 maasserver.
2024-06-24 17:36:42 maasserver.
2024-06-24 17:36:43 maasserver.
2024-06-24 17:36:43 maasserver.
2024-06-24 17:36:43 maasserver.
2024-06-24 17:36:43 maasserver.
2024-06-24 17:36:43 maasserver.
2024-06-24 17:36:48 maasserver.
2024-06-24 17:36:48 maasserver.
2024-06-24 17:36:58 maasserver.
2024-06-24 17:36:58 maasserver.
2024-06-24 17:36:58 maasserver.
2024-06-24 17:36:58 maasserver.
It is somewhat interesting to note that once regiond is in the high CPU utilization state, closing the client browser seems to to have no immediate remedial effects. If these websockets are indeed the cause of the problem, they appear to remain active between intermediate MaaS components (ie. nginx and regiond?) without a running MaaS UI, requiring a regiond or MaaS restart to clean things up.
It is expected that this UI reconnect path should not end up with more than one websocket connection in order to service a single UI client. On a production system this can create excessive load burdens that lead to API request timeouts and other confusing errors.
Steps to reproduce:
=======
+Created a new Ubuntu 22.04 install on a dedicated VM
+Installed MaaS and MaaS Postgres TestDB
Steps include:
snap install --channel=3.4 maas
snap install maas-test-db
maas init region+rack --database-uri maas-test-db:///
maas createadmin --username admin --password <password> --email <email>
+Went to the MaaS UI and followed the install setup steps, picking Ubuntu 22.04 images, generally picking defaults, skipping user setup and skipping most everything else
+From a Firefox browser (which happened to be version 118.0.1), navigate to the MaaS UI and open web developer tools.
-Using filtering parameters querying machines has shown to increase the likelihood and duration of the problem, especially on a real MaaS system: http://<ip>:5240/
+Drop the network connection between the VM and the MaaS UI
-Originally this happened through VPN disconnects when running against our production MaaS server
-For reproduction convenience on a test VM it proved more efficient for the VM to have two network interfaces, and to simply down the interface used to connect the MaaS UI and the server allowing the other to remain available for ssh access to the VM for debugging
+After ~15-25 minutes, the MaaS UI will discover connectivity issues, declare failure and display a "Trying to reconnect..." overlay
+Wait another hour or so. Throughout this time, observe periodic GET request attempts, presumably attempting to reconnect...
GET ws://<ip>
GET ws://<ip>
GET ws://<ip>
...
+Restore the network connection between the VM and the MaaS UI
+If a python3 regiond pegs a CPU 100% for many minutes and database usage is noticeably elevated, the problem has been reproduced. Attempts to use the MaaS API during this time will result in intermittent errors, seemingly based on whether they are assigned by socket to the disrupted regiond worker. The more machines/
Changed in maas: | |
assignee: | nobody → Jacopo Rota (r00ta) |
status: | New → In Progress |
Changed in maas: | |
assignee: | nobody → Peter Makowski (petermakowski) |
status: | Triaged → Fix Committed |
Hi
Thank you very much for all the investigations and the very detailed bug report!
I have a first question: in the steps to reproduce instructions you said
> The more machines/ resources/ load on the MaaS server, and the more outstanding GET requests when network connectivity is restored seem to increase the duration of the disruption.
Did you manage to reproduce this issue on a MAAS instance with nothing going on (no deployments, no external API call to MAAS and similar)? How many machines did you create on this test instance?