[RFE] Move to an agent-based design

Bug #2024385 reported by Baptiste Jonglez
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Networking ML2 Generic Switch
Confirmed
Undecided
Unassigned

Bug Description

## Current design

Currently, the entire NGS code is dynamically loaded as a ML2 mechanism driver alongside a Neutron process. It means that all SSH operations (netmiko) are performed in the context of the Neutron process.

When Neutron is deployed with several threads or processes, we end up with several independent instances of NGS. There is an optional locking mechanism based on etcd to coordinate these NGS instances. There is an additional batching mechanism, also based on etcd, allowing each NGS instance to manage a queue of requests for a switch and send several commands over a single SSH connection for improved performance.

https://docs.openstack.org/networking-generic-switch/latest/configuration.html#synchronization
https://docs.openstack.org/networking-generic-switch/latest/configuration.html#batching

## Proposed agent-based design

We would like to move all SSH operations to a NGS agent, running as a single dedicated process that would take commands from Neutron through RPC calls / RabbitMQ.

It would bring the following benefits:

1) better performance: this is our main motivation.
a. the agent would run one eventlet thread per switch, so that it can parallelize configuration across different switches
b. each eventlet thread would be able to efficiently batch commands over a single SSH connection by continuous pulling commands from an in-memory queue
c. each eventlet thread could keep its SSH connection open for some time after the last command, so that future commands are faster
d. we benchmarked our agent-based prototype and it performs better than the current etcd-based batching mechanism, because we can batch more commands together.

2) simpler code and operation: no distributed locking is necessary, etcd would no longer be required

3) uwsgi compatibility: the current batching mechanism does not work when running Neutron with uwsgi, because uwsgi does not support asynchronous eventlet tasks that run outside of an API call. The agent design would overcome this limitation because all asynchronous tasks would be executed outside of Neutron itself.

4) security: secrets such as SSH passwords would only be accessible from the agent process, while currently the secrets are accessible from the main Neutron processes

5) flexibility: the NGS agent could run on a different host than Neutron, which may be required for security (e.g. access-list filters that only allow SSH connection from a "bastion") or performance (move the agent closer to the physical switches for lower latency)

## Design details

Only a small part of the NGS code would remain in the ML2 mechanism driver: it would simply listen for network/subnet/port creation or deletion events, perform basic checking, and then call the agent through the RPC bus and wait for the answer (success/fail) or timeout.

At startup, the agent spawns an eventlet greenthread + queue for each physical switch present in the configuration. Each greenthread loops indefinitely, waiting for commands to be pushed into its queue. When commands are received on the queue, the greenthread opens a SSH connection to the switch, performs as many commands in a row as possible (batching), then possibly waits a bit for new commands to arrive, and finally saves the configuration on the switch and closes the SSH connection.

When the main agent thread receives a request, it looks up which switch(es) should be configured, puts the request in the right queue(s), and waits for the eventlet greenthread(s) to finish processing the request(s). For a port creation/deletion, only a single switch needs to be configured. For a network creation/deletion, several switches need configuration in parallel, and the agent performs a "join" to wait for all parallel requests to complete.

## HA support

Some people want to run NGS with HA. The design should allow several agents to run in parallel and to fail-over if an agent crashes or fails. There are a few problems to solve:

- good batching performance: all requests for a given switch should be routed to the same agent to ensure it can batch them. It has been suggested that a hash-ring could help.

- concurrent requests for the same switch: several agents might try to configure the same switch concurrently. It works on some models, but other models cannot handle this. Possible solutions: rely on retries, or use the distributed etcd locks to limit concurrency.

## No backwards compatibility

It will be difficult to keep compatibility: we plan to cleanly separate the code of the ML2 mechanism driver and the new code of the agent. Using the agent would become mandatory, it means that NGS users will have to start an additional ngs-agent process after upgrading.

This point is open for discussion.

Tags: needs-spec rfe
Revision history for this message
Baptiste Jonglez (bjonglez) wrote :
Revision history for this message
Baptiste Jonglez (bjonglez) wrote :

Previous RFE attempt to improve performance was https://bugs.launchpad.net/neutron/+bug/1976270 in Neutron, but modifying the interface of the Neutron ML2 mechanism drivers was not acceptable.

Revision history for this message
Baptiste Jonglez (bjonglez) wrote :

Our current prototype implementation of ngs-agent can also perform asynchronous configuration, i.e. NGS replies back to the API call immediately, then performs the switch configuration in the background, and finally changes the status of the port from BUILD to ACTIVE (or ERROR) in Neutron.

This has very good performance, especially when using the bulk API in Neutron. We presented performance results at the Vancouver 2023 summit.

However, it requires clients to poll for the state of the ports while they are in the BUILD state, which adds complexity to the clients. Also, this asynchronous implementation is complex and prone to race conditions. So, we are aiming at upstreaming a synchronous solution instead.

Changed in networking-generic-switch:
status: New → Confirmed
Revision history for this message
Harald Jensås (harald-jensas) wrote :

This is an intriguing approach. If the no backwards compatibility issue is too big to swallow for NGS an alternative would be to implement this in ML2 networking-baremetal which have an agent with HA support already. NOTE: the HA implementation used there might not be fit for this usecase. The switch configuration support in networking-baremetal is quite recent, _experimental_ and afik not used by anyone in production i.e I would consider breaking backward compatibility a non-issue.

Revision history for this message
Baptiste Jonglez (bjonglez) wrote :

Thanks for the pointer. From my understanding, the agent in networking-baremetal is not used for switch configuration. The only purpose of this agent is to feed physnet mappings back to Neutron.

The recent switch configuration code in networking-baremetal is similar to what currently exists in NGS: it is synchronous and it runs in the context of the neutron/neutron-api process.

To be honest, I'm a bit surprised to see this netconf-based switch configuration code in networking-baremetal. What was the reasoning behind it? It seems to me that it would have been easier to integrate it in NGS, having a single "switch configuration" project with different backend drivers. That way, it would benefit from the existing infrastructure (config parsing, optional locking to control parallelism, performance optimisation, well-tested ML2 integration...)

Revision history for this message
Harald Jensås (harald-jensas) wrote :

That is correct Baptiste, the agent in networking-baremetal does not do switch configuration, only the physnet mappings - thus my comment on the HA approach there potentially not meeting the requirement for an agent doing switch configuration... The switch config runs in the neutron process just like NGS currently does.

The was no particular reason for not putting the netconf-based switch config in networking-baremetal vs NGS, I guess I just felt the approach was different enough compared to the netmiko approach used for devices in NGS, adding config parsing did not feel like a lot of overhead and netconf has locking "built-in" etc.

anyhow, I hope that your work can get it's way into NGS! :)

Revision history for this message
Michael Sherman (msherman-uchicago) wrote :

This is a timely thread, as I was evaluating adding RESTCONF support to our fork of NGS to work switches running the SONiC OS. That said, I don't want to clutter this up, so I'd appreciate a pointer to a good place for discussion.

Revision history for this message
Harald Jensås (harald-jensas) wrote :

@Michael, I think moving the RESTCONF discussion to the openstack-discuss mailing list?

Does the SONiC OS RESTCONF use the schemas from OpenConfig[1]?
If it does then the "library" for openconfig in networking-baremetal might be useful, we may want to break that out to a separate repo/package if that is something that can be shared.

[1] https://www.openconfig.net/

Revision history for this message
Baptiste Jonglez (bjonglez) wrote :

Here is our code submission with the new agent-based design:

https://review.opendev.org/c/openstack/networking-generic-switch/+/897047

This is joint work from me and Matthieu Imbert.

It took us longer than expected to figure out how to redesign unit tests without losing too much coverage.

We plan to submit a second batch of changes with the performance improvements presented in the design (batching and parallelization).

Revision history for this message
Jay Faulkner (jason-oldos) wrote :

IMO this change should require a spec. At a minimum, we need to have the neccessary conversation about the RFE being approved or not.

tags: added: needs-spec
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to networking-generic-switch (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/networking-generic-switch/+/896112
Committed: https://opendev.org/openstack/networking-generic-switch/commit/4d320c5d2622e73fdb4891ce7812f26e9a12a8a4
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 4d320c5d2622e73fdb4891ce7812f26e9a12a8a4
Author: Baptiste Jonglez <email address hidden>
Date: Tue Jun 21 17:20:37 2022 +0200

    Do not make actual device changes in bind_port()

    According to the bind_port specification (as summed up in its docstring),
    we should not make any change when bind_port() is called, because binding
    results might end up not getting committed.

    To satisfy this specification, this patch moves actual device
    reconfiguration to update_port_postcommit().

    This makes the behaviour symmetrical for port creation (handled in
    update_port_postcommit) and port deletion (handled in delete_port_postcommit).

    In addition, this will make it easier to improve performance of port
    creation, see https://bugs.launchpad.net/networking-generic-switch/+bug/2024385

    Note that this change introduces a different retry behaviour when we fail
    to configure a device. Since bind_port() is retried several times by
    Neutron, we indirectly benefited from these retries, but this is no longer
    the case. However, NGS already has an internal retry mechanism for SSH
    connection errors, which should cover most network-related issues.

    To sum up, errors that are no longer covered by a retry are the ones that
    happen after we successfully connect to the device. For instance, the
    switch port may not exist on the device, or the device could be in an
    unexpected state such as a port being currently part of an unexpected
    VLAN. This kind of errors are unlikely to be solved by retrying, so the
    new behaviour should be fine and will even allow to return the error much
    faster to the end-user.

    Related-Bug: #2024385
    Depends-on: https://review.opendev.org/c/openstack/networking-generic-switch/+/896110
    Change-Id: If4ca9c58d7f30b40992d0f1aee7e915c6978e0ca
    (cherry picked from commit 0088fde1debfd80ab2508bfea9f2cff2a1602e80)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.