ha backup router ipv6 accept_ra broken

Bug #1958149 reported by Maximilian Stinsky
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
In Progress
Medium
Maximilian Stinsky

Bug Description

When we restart the neutron-l3-agent we observe that backup routers start accepting router advertisements. This leads to routes inside the router namespace which expire.
e.g.:
$ ip netns exec qrouter-a5f7fb32-3e30-4e15-89f9-4ae888c2cac6 ip -6 r
x:x:1002:1::/64 dev qr-72f85121-ce proto kernel metric 256 expires 86355sec pref medium
x:x:1002:1::/64 dev qr-4e84792f-aa proto kernel metric 256 expires 86355sec pref medium
fe80::/64 dev ha-9d085c9d-15 proto kernel metric 256 pref medium
default via fe80::f816:3eff:fed3:3fa6 dev qr-4e84792f-aa proto ra metric 1024 expires 255sec hoplimit 64 pref medium
default via fe80::f816:3eff:fed3:3fa6 dev qr-72f85121-ce proto ra metric 1024 expires 255sec hoplimit 64 pref medium

When we now failover to such a backup router, the kernel does not create the necessary directly attached routes because they already exist. The problem is that those routes expire and because we are now a master router the routes do not refresh from the router advertisement anymore and expire after 24h which breaks ipv6 for those routers.

After we dug a bit deeper into this issue we found that the function [1] that disables the accept_ra on the backup routers always returns false. So backup routers never get their router advertisement disabled.

master router:
$ ip netns exec qrouter-92ed5c1f-c705-4ab9-a0e1-56e905d43abd sysctl net.ipv6.conf.qr-c7eb60ab-f1.accept_ra
net.ipv6.conf.qr-c7eb60ab-f1.accept_ra = 1

backup router:
$ ip netns exec qrouter-92ed5c1f-c705-4ab9-a0e1-56e905d43abd sysctl net.ipv6.conf.qr-c7eb60ab-f1.accept_ra
net.ipv6.conf.qr-c7eb60ab-f1.accept_ra = 1

[1] https://github.com/openstack/neutron/blob/stable/train/neutron/agent/l3/ha_router.py#L318

Tags: l3-ha
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/824947

Changed in neutron:
status: New → In Progress
tags: added: l3-ha
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

A couple of more observations regarding the accept_ra sysctl value.
In our environment we set accept_ra to 0 on the root namespace and on a new router creating the master router is not getting set to accept_ra=1 with my patch. My assumption is that in the moment the function is getting executed the router is still in standby state for neutron and after it gets into the master state its not getting set again.

A second thing is that the values seem to also not get updated on a router failover. The old master router stays on accept_ra=1 and the new master stays on accept_ra=0. I can again fix this state with a neutron-l3-agent restart with the patch applied.

It seems to me that the router state change function [1] does not trigger setting the accept_ra values again.

[1] https://github.com/openstack/neutron/blob/59959ce88d9d8010793e6e5c8ddedc26fc97b668/neutron/agent/l3/ha.py#L167

Changed in neutron:
assignee: nobody → Maximilian Stinsky (mstinsky)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.