General scale issue on neutron-fwaas due to RPC broadcast usage (fanout)

Bug #1659760 reported by Bertrand Lallau
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
networking-midonet
Fix Released
Medium
YAMAMOTO Takashi
neutron
Fix Released
Medium
Bertrand Lallau

Bug Description

Actually on all CRUDs methods used on FWaaS resources (Firewall, FirewallPolicy, FirewallRule, Firewallgroup, ...) an AMQP fanout cast is sent to all L3 agents.
This is a wrong design, AMPQ cast should be send only to L3Agents managing routers with firewalls related to the tenant.

This wrong design result in many bugs already reported:

1) FirewallNotFound during firewall_deleted
https://bugs.launchpad.net/neutron/+bug/1622460
https://bugs.launchpad.net/neutron/+bug/1658060

Explanation using 2 L3agents:
agent1: host router with firewall for tenant
agent2: doesn't host tenant router

  1. neutron firewall-delete <firewall>
  2. neutron-server send an AMQP call "delete_firewall" to agent1 and agent2
  3. agent1 clean router firewall and send back "firewall_deleted" to neutron-server
  4. neutron-server delete firewall resource from database
  5. agent2 has nothing to clean and send back firewall_deleted to neutron-server
  6. neutron-server get an exception "FirewallNotFound"
     http://paste.openstack.org/raw/94663/

  But this is not ended :(
  7. agent2 get back the "FirewallNotfound" exception
  8. on RPC error it will performed a kind of "full synchronisation" (process_services_sync)
     send an AMQP call "get_tenants_with_firewalls"
  9. neutron-server will respond back with a ALL tenants (even if it's not related to this agents)
  10 FOR each tenant agent2 will sent a AMQP call:
     get_firewalls_for_tenant()

Full sync bug is already reported here:
https://bugs.launchpad.net/neutron/+bug/1618244

2) Intermittent failed on Tempest check is probably link:
https://bugs.launchpad.net/neutron/+bug/1649703

3) More generally on FWaaS CRUDs operations neutron-server flood and is flooded by many AMQP requests.
=> this result in neutron-server RPC worker fully busy
=> AMQP messages accumulated in q-firewall-plugin queue
=> RPC Timeout appears on agents after (60s)
=> full synchronisation triggered
=> etc, etc...

Changed in neutron:
assignee: nobody → Bertrand Lallau (bertrand-lallau)
Changed in neutron:
status: New → In Progress
description: updated
description: updated
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron-fwaas (master)

Fix proposed to branch: master
Review: https://review.openstack.org/426287

description: updated
description: updated
Changed in neutron:
assignee: Bertrand Lallau (bertrand-lallau) → Cedric Brandily (cbrandily)
Changed in neutron:
assignee: Cedric Brandily (cbrandily) → Bertrand Lallau (bertrand-lallau)
tags: added: fwaas
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron-fwaas (master)

Reviewed: https://review.openstack.org/426287
Committed: https://git.openstack.org/cgit/openstack/neutron-fwaas/commit/?id=da425fd913c168d5477588df5d8574fce21e2eb7
Submitter: Jenkins
Branch: master

commit da425fd913c168d5477588df5d8574fce21e2eb7
Author: Bertrand Lallau <email address hidden>
Date: Fri Jan 27 16:52:09 2017 +0100

    Fix RPC scale issue using cast instead of fanout v1

    Actually all CRUDs methods used on FWaaS v1 resources (Firewall,
    FirewallPolicy, FirewallRule) results on AMQP fanout cast requests
    sent to all L3 agents (even if they don't have routers or firewalls).

    This fix send AMQP cast only to L3 agents affected by the corresponding
    firewall.

    Such trouble also impacts FWaaS v2 and will be solved in a follow-up
    change.

    Change-Id: Id6cb991aee959319997bb15ece240c09d4ac5e39
    Closes-Bug: #1659760

Changed in neutron:
status: In Progress → Fix Released
Changed in networking-midonet:
importance: Undecided → Medium
assignee: nobody → YAMAMOTO Takashi (yamamoto)
milestone: none → 5.0.0
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to networking-midonet (master)

Fix proposed to branch: master
Review: https://review.openstack.org/444081

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to networking-midonet (master)

Reviewed: https://review.openstack.org/444081
Committed: https://git.openstack.org/cgit/openstack/networking-midonet/commit/?id=bc33639d02a6d6a64aec63d2c15eafbb54247d61
Submitter: Jenkins
Branch: master

commit bc33639d02a6d6a64aec63d2c15eafbb54247d61
Author: YAMAMOTO Takashi <email address hidden>
Date: Fri Mar 10 12:14:22 2017 +0900

    fwaas: Add "host" argument for agent_rpc methods

    This is a prepartion for the neutron-fwaas change. [1]

    [1] I68cbf7403a17ddba49cc5943fb110c1d772e8834

    Closes-Bug: #1659760
    Change-Id: I710c7dc0f07781e5ed8deb0b91ad4889c865ce59

Changed in networking-midonet:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron-fwaas (master)

Reviewed: https://review.openstack.org/443121
Committed: https://git.openstack.org/cgit/openstack/neutron-fwaas/commit/?id=20339322b32f29f63f5df1518d1892cd77fe57ca
Submitter: Jenkins
Branch: master

commit 20339322b32f29f63f5df1518d1892cd77fe57ca
Author: Bertrand Lallau <email address hidden>
Date: Wed Mar 8 14:02:40 2017 +0100

    Revert "Revert "Fix RPC scale issue using cast instead of fanout v1""

    This reverts commit 9ab80c7d9814b4db6b4a75ff5693bfa7a8076853.

    Address FirewallAgentApi signature changes issue

    Partial-Bug: #1659760
    Change-Id: I68cbf7403a17ddba49cc5943fb110c1d772e8834

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron-fwaas (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/447931

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to networking-midonet (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/450572

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-midonet 5.0.0.0b1

This issue was fixed in the openstack/networking-midonet 5.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron-fwaas 11.0.0.0b1

This issue was fixed in the openstack/neutron-fwaas 11.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to networking-midonet (stable/ocata)

Reviewed: https://review.openstack.org/450572
Committed: https://git.openstack.org/cgit/openstack/networking-midonet/commit/?id=9bfd907e6a1783223f8bf813e021036e766ea2d2
Submitter: Jenkins
Branch: stable/ocata

commit 9bfd907e6a1783223f8bf813e021036e766ea2d2
Author: YAMAMOTO Takashi <email address hidden>
Date: Fri Mar 10 12:14:22 2017 +0900

    fwaas: Add "host" argument for agent_rpc methods

    This is a prepartion for the neutron-fwaas change. [1]

    [1] I68cbf7403a17ddba49cc5943fb110c1d772e8834

    Closes-Bug: #1659760
    Change-Id: I710c7dc0f07781e5ed8deb0b91ad4889c865ce59
    (cherry picked from commit bc33639d02a6d6a64aec63d2c15eafbb54247d61)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron-fwaas (stable/ocata)

Reviewed: https://review.openstack.org/447931
Committed: https://git.openstack.org/cgit/openstack/neutron-fwaas/commit/?id=427b35964a499b12b725de57c48a31323ffcd95d
Submitter: Jenkins
Branch: stable/ocata

commit 427b35964a499b12b725de57c48a31323ffcd95d
Author: Bertrand Lallau <email address hidden>
Date: Wed Mar 8 14:02:40 2017 +0100

    Fix RPC scale issue using cast instead of fanout v1

    Actually all CRUDs methods used on FWaaS v1 resources (Firewall,
    FirewallPolicy, FirewallRule) results on AMQP fanout cast requests sent
    to all L3 agents (even if they don't have routers or firewalls). This
    fix send AMQP cast only to L3 agents affected by the corresponding
    firewall. Such trouble also impacts FWaaS v2 and will be solved in a
    follow-up change.

    Partial-Bug: #1659760
    Change-Id: I68cbf7403a17ddba49cc5943fb110c1d772e8834
    (cherry picked from commit 47581636ba666f40ece132867da4ce7830a756e7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-midonet 4.1.0

This issue was fixed in the openstack/networking-midonet 4.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.