DVR router takes too long to learn octavia LB VIP

Bug #1979002 reported by Alexandre Perreault
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
Medium
Unassigned

Bug Description

Summary
Hi,
We are facing a connectivity problem when trying to communicate to an octavia loadbalancer over a DVR router. When making an initial request to the LB, the DVR router takes too much time to learn the MAC address of the LB VIP and then it sends the message No route to host.
The DVR router will end up learning the MAC and communication will work on the second and third request but the problem reappears if there are no requests to the LB for over a minute. At this point, the ARP entry disappears from the router's table and it must learn the MAC again.
I expect the dvr router to learn the MAC in ms, not seconds.

I currently see this problem in the Yoga version but it is not a new problem. I detected this in Ussuri as well. I was expecting improvements in Yoga.

Openstack version: Yoga
octavia topology: ACTIVE_STANDBY

Step by step
Create NetworkA
Create two instances with apache (web server) on NetworkA. These will be our LB members.
Create a LB on NetworkA. Create a HTTP listener. Create a pool with that listener. Create two LB members in the pool. The members should be the IP addresses of the two instances created previously.

Create NetworkB
Create an instance on NetworkB. This will be used to curl http://<LB-VIP>.

Create a DVR router. Connect NetworkA and NetworkB to this router.

At this point the ARP table of the router will have permanent ARP entries for all instances on NetworkA and NetworkB including the amphora instances.
It will not have the ARP entry for the LB VIP. I assume that is normal.

Now on the instance on networkB, curl the LB VIP. In my case curl http://10.86.86.196.
I usually receive the following error.
Failed to connect to 10.86.86.196 port 80: No route to host.

If I try again right after the failed attempt, it works! I see the output of my web server.

I did some packet captures on the dvr router on the compute server and I also watched its ARP table.
At first there is no ARP entry for the LB VIP.
Then I made the request. Here is the tcpdump output of the router interface connected to NetworkB.
ip netns exec qrouter-499e1a42-005b-42e9-be29-a347963c455e tcpdump -nni qr-053b19e7-67
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on qr-053b19e7-67, link-type EN10MB (Ethernet), capture size 262144 bytes
18:07:36.888353 IP 10.87.87.16.56450 > 10.86.86.196.80: Flags [S], seq 4098439773, win 26730, options [mss 8910,sackOK,TS val 1768131196 ecr 0,nop,wscale 7], length 0
18:07:37.913880 IP 10.87.87.16.56450 > 10.86.86.196.80: Flags [S], seq 4098439773, win 26730, options [mss 8910,sackOK,TS val 1768132221 ecr 0,nop,wscale 7], length 0
18:07:39.929891 IP 10.87.87.16.56450 > 10.86.86.196.80: Flags [S], seq 4098439773, win 26730, options [mss 8910,sackOK,TS val 1768134237 ecr 0,nop,wscale 7], length 0
18:07:39.946622 IP 10.87.87.1 > 10.87.87.16: ICMP host 10.86.86.196 unreachable, length 68
18:07:39.946775 IP 10.87.87.1 > 10.87.87.16: ICMP host 10.86.86.196 unreachable, length 68
18:07:39.946869 IP 10.87.87.1 > 10.87.87.16: ICMP host 10.86.86.196 unreachable, length 68

We can see that there are 3 requests made from my instance to the LB VIP at 1 second intervals.
Then the router responds with ICMP host 10.86.86.196 unreachable. This is why I see on the instance the error "no route to host".

Here is the tcpdump output of the router interface connected to NetworkA for the same request.
ip netns exec qrouter-499e1a42-005b-42e9-be29-a347963c455e tcpdump -nni qr-1622f027-67
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on qr-1622f027-67, link-type EN10MB (Ethernet), capture size 262144 bytes
18:07:36.888401 ARP, Request who-has 10.86.86.196 tell 10.86.86.1, length 28
18:07:37.898590 ARP, Request who-has 10.86.86.196 tell 10.86.86.1, length 28
18:07:38.926591 ARP, Request who-has 10.86.86.196 tell 10.86.86.1, length 28
18:07:41.345337 ARP, Request who-has 10.86.86.196 (ff:ff:ff:ff:ff:ff) tell 10.86.86.196, length 28
18:07:41.345399 ARP, Request who-has 10.86.86.196 (ff:ff:ff:ff:ff:ff) tell 10.86.86.196, length 28

We see that it makes 3 ARP requests to get the MAC but no reply.
The last two packets in the tcpdump is the LB itself checking that no one else is using the IP 10.86.86.196.
I do see that it does learn the MAC but it's too late.
What is strange is that when it does learn it, I do not see the ARP reply.

Since the ARP entry disappears after a minute or so, this problem happens often. There are times where it works on the first try but it is rare. Even when it works, it still takes the router 2 seconds to learn which is slightly faster then 3 seconds when it fails.

Note: the LB, the LB members and the instance are not on the same compute server.

Lastly, I do not see any problems if my instance communicating with the LB is on the same network as the LB.
Furthermore, if I assign a FIP to the LB and communicate to the LB from the internet, I do not see any problems. The SNAT router namespace is able to learn the MAC quickly, every time.
This is very specific to the DVR router (qrouter namespace) on the compute servers.

We have quite a few users on different installations with similar architecture complaining that they are facing random communication problems because of the no route to host error explained above.

Revision history for this message
Miguel Lavalle (minsel) wrote :

Have you reported this with Octavia as well? We might need their help to fix this

Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

Is it with ML2/OVS?

A similar issue was reported against Octavia in https://storyboard.openstack.org/#!/story/2009765

See Michael Johnson's comment at https://storyboard.openstack.org/#!/story/2009765#comment-183692

It looks like an issue with DVR and allowed_address_pairs ARP entries
https://bugs.launchpad.net/neutron/+bug/1774459

Revision history for this message
Alexandre Perreault (alexperreault) wrote :

Hi,

Thanks for the quick comments.

Hi Miguel, looks more like a neutron dvr arp issue. Octavia I think just uses vrrp for it's vip which is fairly standard but I don't know the exact details.

Hi Gregory, the first link is interesting. One big difference is that the user actually has a permanent ARP entry in the qrouter and that entry is wrong.
In my case, I do not have a permanent ARP entry. My qrouter tries to learn the VIP MAC but it is too slow most of the time so the query fails.
As you pointed out, octavia team puts the blame on neutron.

The bug you sent (3rd link) is long. It dates back from 2018. I hope it resolves the issue as I have multiple users complaining about it. I will post a comment and hopefully someone is actively working on it.

Thanks for the response.

Revision history for this message
Lajos Katona (lajos-katona) wrote :

Just for reference the patch which seems to be the solution:
https://review.opendev.org/c/openstack/neutron/+/601336

Revision history for this message
Alexandre Perreault (alexperreault) wrote :

Hi Lajos,

Thanks. If I understand correctly, this patch is not complete yet?

tags: added: l3-dvr-backlog
Revision history for this message
Lajos Katona (lajos-katona) wrote :

You are right the patch is not ready, but I see no technical/architectural reason against the patch.
The patch itself is quite frightening as it is big, and hard to review/test.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.