[3.0.2.0-32~liberty] 100K DHCP Request: Agent stops responding and is in deadlock

Bug #1576332 reported by chhandak
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.0
Fix Committed
High
Hari Prasad Killi
Trunk
Fix Committed
High
Hari Prasad Killi

Bug Description

While sending 100K DHCP request from BMS to TSN, TSN stops responding to DHCP request after some time.

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p514p1, link-type EN10MB (Ethernet), capture size 65535 bytes
08:35:30.379967 IP 32.32.32.32.7893 > 172.17.90.6.4789: VXLAN, flags [I] (0x08), vni 1021
IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:03:01:01:02:f6, length 261
08:35:30.379973 IP 32.32.32.32.7893 > 172.17.90.6.4789: VXLAN, flags [I] (0x08), vni 1021
IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:03:01:01:02:f6, length 261

root@5b7s6:~# vxlan --get 1021
VXLAN Table

 VNID NextHop
----------------
   1021 30
root@5b7s6:~# nh --get 30
Id:30 Type:Vrf_Translate Fmly: AF_INET Rid:0 Ref_cnt:2 Vrf:63
              Flags:Valid, Vxlan,
              Vrf:63

root@5b7s6:~# rt --dump 63 --family bridge
Flags: L=Label Valid, Df=DHCP flood
vRouter bridge table 0/63
Index DestMac Flags Label/VNID Nexthop
192464 0:0:5e:0:1:0 Df - 3
212412 ff:ff:ff:ff:ff:ff LDf 1021 178
380336 90:e2:ba:a7:32:24 Df - 3
531528 0:3:1:1:2:f6 L 1021 13
988496 0:3:1:1:2:f7 - 1
root@5b7s6:~#
root@5b7s6:~# nh --get 13
Id:13 Type:Tunnel Fmly: AF_INET Rid:0 Ref_cnt:252 Vrf:0
              Flags:Valid, Vxlan,
              Oif:0 Len:14 Flags Valid, Vxlan, Data:0c 86 10 3c 2b 00 90 e2 ba a7 32 24 08 00
              Vrf:0 Sip:172.17.90.6 Dip:32.32.32.32

root@5b7s6:~# dropstats | grep -v '0$'

Discards 951025
Cloned Original 5249517

Invalid NH 31436
Invalid Mcast Source 1830216

Duplicated 14

Invalid VNID 452397
No L2 Route 36811

root@5b7s6:~# dropstats | grep -v '0$'

Discards 951025
Cloned Original 5249517

Invalid NH 31436
Invalid Mcast Source 1830216

Duplicated 14

Invalid VNID 452397
No L2 Route 36811

(gdb) info thr
  Id Target Id Frame
  23 Thread 0x7f87c30cd700 (LWP 28327) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  22 Thread 0x7f87c2ccc700 (LWP 28328) "contrail-vroute" 0x00007f87ca8383bd in read () at ../sysdeps/unix/syscall-template.S:81
  21 Thread 0x7f87c28cb700 (LWP 28329) "contrail-vroute" __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
  20 Thread 0x7f87c24ca700 (LWP 28330) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  19 Thread 0x7f87c20c9700 (LWP 28332) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  18 Thread 0x7f87c1cc8700 (LWP 28331) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  17 Thread 0x7f87c18c7700 (LWP 28333) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  16 Thread 0x7f87c14c6700 (LWP 28334) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  15 Thread 0x7f87c0ab0700 (LWP 28693) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  14 Thread 0x7f87c06af700 (LWP 28694) "contrail-vroute" _int_malloc (av=0x7f8780000020, bytes=24) at malloc.c:3472
  13 Thread 0x7f872e3ff700 (LWP 31642) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  12 Thread 0x7f872dffe700 (LWP 31643) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  11 Thread 0x7f86c66ff700 (LWP 1463) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  10 Thread 0x7f86c62fe700 (LWP 1464) "contrail-vroute" __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
  9 Thread 0x7f85b25ff700 (LWP 7979) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  8 Thread 0x7f85b21fe700 (LWP 7980) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7 Thread 0x7f85e3fff700 (LWP 26173) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  6 Thread 0x7f85e3bfe700 (LWP 26174) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  5 Thread 0x7f85c3bfe700 (LWP 2733) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  4 Thread 0x7f85c37fd700 (LWP 2734) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3 Thread 0x7f85c3fff700 (LWP 2735) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2 Thread 0x7f85c33fc700 (LWP 2736) "contrail-vroute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
* 1 Thread 0x7f87cc7ad7c0 (LWP 28308) "contrail-vroute" _int_malloc (av=0x7f87c9dce760 <main_arena>, bytes=9060) at malloc.c:3775

(gdb) thr 10
[Switching to thread 10 (Thread 0x7f86c62fe700 (LWP 1464))]
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95 ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007f87c9a94bcb in _L_lock_4651 () at malloc.c:5206
#2 0x00007f87c9a8f3e3 in _int_free (av=0x7f87c9dce760 <main_arena>, p=0xaa7708f480, have_lock=0) at malloc.c:3943
#3 0x0000000000c3142a in PacketBuffer::~PacketBuffer() ()
#4 0x000000000082bb8e in boost::detail::sp_counted_base::release() ()
#5 0x0000000000c31efc in PktInfo::~PktInfo() ()
#6 0x000000000082bb8e in boost::detail::sp_counted_base::release() ()
#7 0x0000000000c4755e in tbb::strict_ppl::internal::micro_queue<boost::shared_ptr<PktInfo> >::pop(void*, unsigned long, tbb::strict_ppl::internal::concurrent_queue_base_v3<boost::shared_ptr<PktInfo> >&) ()
#8 0x0000000000c476b6 in tbb::strict_ppl::internal::concurrent_queue_base_v3<boost::shared_ptr<PktInfo> >::internal_try_pop(void*) ()
#9 0x0000000000c47fab in QueueTaskRunner<boost::shared_ptr<PktInfo>, WorkQueue<boost::shared_ptr<PktInfo> > >::RunQueue() ()
#10 0x000000000118d89c in TaskImpl::execute() ()
#11 0x00007f87ca615b3a in ?? () from /usr/lib/libtbb.so.2
#12 0x00007f87ca611816 in ?? () from /usr/lib/libtbb.so.2
#13 0x00007f87ca610f4b in ?? () from /usr/lib/libtbb.so.2
#14 0x00007f87ca60d0ff in ?? () from /usr/lib/libtbb.so.2
#15 0x00007f87ca60d2f9 in ?? () from /usr/lib/libtbb.so.2
#16 0x00007f87ca831182 in start_thread (arg=0x7f86c62fe700) at pthread_create.c:312
#17 0x00007f87c9b0a47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

(gdb) thr 21
[Switching to thread 21 (Thread 0x7f87c28cb700 (LWP 28329))]
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95 in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) bt
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007f87c9a94bcb in _L_lock_4651 () at malloc.c:5206
#2 0x00007f87c9a8f3e3 in _int_free (av=0x7f87c9dce760 <main_arena>, p=0x1c4b2bdbb0, have_lock=0) at malloc.c:3943
#3 0x0000000000c3142a in PacketBuffer::~PacketBuffer() ()
#4 0x000000000082bb8e in boost::detail::sp_counted_base::release() ()
#5 0x0000000000c31efc in PktInfo::~PktInfo() ()
#6 0x000000000082bb8e in boost::detail::sp_counted_base::release() ()
#7 0x0000000000c47fd2 in QueueTaskRunner<boost::shared_ptr<PktInfo>, WorkQueue<boost::shared_ptr<PktInfo> > >::RunQueue() ()
#8 0x000000000118d89c in TaskImpl::execute() ()
#9 0x00007f87ca615b3a in ?? () from /usr/lib/libtbb.so.2
#10 0x00007f87ca611816 in ?? () from /usr/lib/libtbb.so.2
#11 0x00007f87ca610f4b in ?? () from /usr/lib/libtbb.so.2
#12 0x00007f87ca60d0ff in ?? () from /usr/lib/libtbb.so.2
#13 0x00007f87ca60d2f9 in ?? () from /usr/lib/libtbb.so.2
#14 0x00007f87ca831182 in start_thread (arg=0x7f87c28cb700) at pthread_create.c:312
#15 0x00007f87c9b0a47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

chhandak (chhandak)
Changed in juniperopenstack:
importance: Undecided → High
assignee: nobody → Hari Prasad Killi (haripk)
information type: Proprietary → Public
summary: - [3.0.2.0-32~liberty] 100K DHCP Request: Agent stope responding and is in
- dead lock
+ [3.0.2.0-32~liberty] 100K DHCP Request: Agent stops responding and is in
+ deadlock
Revision history for this message
chhandak (chhandak) wrote :
Revision history for this message
Hari Prasad Killi (haripk) wrote :

There is no deadlock - DHCP requests were sent in at 100K per second, while they were being processed at 30K per second. This built up a large backlog which takes time to clear. Need to handle this case (rate control / discard beyond a threshold).

Jeba Paulaiyan (jebap)
tags: added: blocker
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.0

Review in progress for https://review.opencontrail.org/19998
Submitter: Hari Prasad Killi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/19998
Committed: http://github.org/Juniper/contrail-controller/commit/0b5f81777d8b411cc4bb84364aae990db203e72c
Submitter: Zuul
Branch: R3.0

commit 0b5f81777d8b411cc4bb84364aae990db203e72c
Author: Hari <email address hidden>
Date: Mon May 9 11:36:58 2016 +0530

Limit the number of entries in the packet handler queue.

If the backlog on the packet handler queue grows, start dropping new
enqueue requests. Test case to check the same.

Change-Id: I164270a006224c8770a37ba412f0e354be14c825
closes-bug: #1576332

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.0

Review in progress for https://review.opencontrail.org/20634
Submitter: Hari Prasad Killi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/20634
Committed: http://github.org/Juniper/contrail-controller/commit/0d170faa831afb4790874676f93b160f753a8d3c
Submitter: Zuul
Branch: R3.0

commit 0d170faa831afb4790874676f93b160f753a8d3c
Author: Hari <email address hidden>
Date: Thu May 26 00:09:40 2016 +0530

Set the work queue limit for different services.

As we do not want the limit to be applied for flow, remove the
limit applied in the packet handler and put it in the services queues.

Change-Id: Iaf9fe4366e128806325d00bc58539eb81a42efe5
closes-bug: #1576332

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/20742
Submitter: Hari Prasad Killi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/20742
Committed: http://github.org/Juniper/contrail-controller/commit/eefdf6c4ee22415da3694064fad2c1399fc841a7
Submitter: Zuul
Branch: master

commit eefdf6c4ee22415da3694064fad2c1399fc841a7
Author: Hari <email address hidden>
Date: Mon May 9 11:36:58 2016 +0530

Set the work queue limit for different services.

As we do not want the limit to be applied for flow, remove the
limit applied in the packet handler and put it in the services queues.

(cherry picked from commit 0b5f81777d8b411cc4bb84364aae990db203e72c)
(cherry picked from commit 0d170faa831afb4790874676f93b160f753a8d3c)

Change-Id: Iaf9fe4366e128806325d00bc58539eb81a42efe5
closes-bug: #1576332

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.