MAAS rack is scaling up the number of connections without limit due to a race condition

Bug #2074122 reported by Jacopo Rota
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Triaged
High
Unassigned
3.3
Triaged
High
Unassigned
3.4
Triaged
High
Unassigned
3.5
Triaged
High
Unassigned

Bug Description

From MAAS 3.3 the rackd might scale up to a big number of RPC connections due to a race condition in the scaling up logic.

All the rackd services might request a client to talk to the region with the `getClientNow`

    @deferred
    def getClientNow(self):
        """Returns a `Defer` that resolves to a :class:`common.Client`
        connected to a region.

        If a connection already exists to the region then this method
        will just return that current connection. If no connections exists
        this method will try its best to make a connection before returning
        the client.

        :raises: :py:class:`~.exceptions.NoConnectionsAvailable` when
            there no connections can be made to a region controller.
        """
        try:
            return self.getClient()
        except exceptions.NoConnectionsAvailable:
            return self._tryUpdate().addCallback(call, self.getClient)
        except exceptions.AllConnectionsBusy:
            log.info(f"There are {len(self.connections.items())} and they are all busy! Scaling up.")
            return self.connections.scale_up_connections().addCallback(
                call, self.getClient, busy_ok=True
            )

and

 @PROMETHEUS_METRICS.failure_counter("maas_rpc_pool_exhaustion_count")
    @inlineCallbacks
    def scale_up_connections(self):
        for ev, ev_conns in self.connections.items():
            # pick first group with room for additional conns
            if len(ev_conns) < self._max_connections:
                # spawn an extra connection
                conn_to_clone = random.choice(list(ev_conns))
                conn = yield self.connect(ev, conn_to_clone.address)
                self.clock.callLater(
                    self._keepalive, self._reap_extra_connection, ev, conn
                )
                return
        raise exceptions.MaxConnectionsOpen()

However, the cloned RPC connection is added to self.connections[eventloop] ONLY AFTER THE HANDSHAKE WITH THE REGION HAS COMPLETED. Meaning that multiple concurrent calls to getClientNow might trigger the creation of hundreds of RPC calls because all of them would evaluate `if len(ev_conns) < self._max_connections:` to true.

see https://git.launchpad.net/~r00ta/maas/commit/?h=rpc-scale-up-bug for a reproducer and https://pastebin.canonical.com/p/MPXWycWhHJ/ for the collected logs.

In particular, this bug is the responsible for the `Too many open files` exception that can show in the rackd logs.

Revision history for this message
Jacopo Rota (r00ta) wrote :

For the time being, the best workaround is to increase the max number of open files of the rackd process

Jacopo Rota (r00ta)
summary: - MAAS rack is scaling up the number of connections without limit
+ MAAS rack is scaling up the number of connections without limit due to a
+ race condition
description: updated
Changed in maas:
milestone: 3.6.0 → 3.6.x
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.