Agent crash in virtual tbb::task* TaskImpl::execute()

Bug #1533495 reported by amit surana
28
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Praveen
R2.21.x
Fix Committed
High
Praveen
R2.22.x
Fix Committed
High
Praveen
Trunk
Fix Committed
High
Praveen

Bug Description

contrail 2.21.1-22

core file location: 10.84.5.112:/cs-shared/bugs/1533495/

(gdb) bt
#0 0x00007ff680a19cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ff680a1d0d8 in __GI_abort () at abort.c:89
#2 0x00007ff680a12b86 in __assert_fail_base (fmt=0x7ff680b63830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x101aa95 "0", file=file@entry=0x10e85f1 "controller/src/base/task.cc", line=line@entry=253,
    function=function@entry=0x10e9ca0 <TaskImpl::execute()::__PRETTY_FUNCTION__> "virtual tbb::task* TaskImpl::execute()") at assert.c:92
#3 0x00007ff680a12c32 in __GI___assert_fail (assertion=0x101aa95 "0", file=0x10e85f1 "controller/src/base/task.cc", line=253,
    function=0x10e9ca0 <TaskImpl::execute()::__PRETTY_FUNCTION__> "virtual tbb::task* TaskImpl::execute()") at assert.c:101
#4 0x0000000000fde523 in TaskImpl::execute (this=<optimized out>) at controller/src/base/task.cc:253
#5 0x00007ff6815e8b3a in ?? () from /usr/lib/libtbb.so.2
#6 0x00007ff6815e4816 in ?? () from /usr/lib/libtbb.so.2
#7 0x00007ff6815e3f4b in ?? () from /usr/lib/libtbb.so.2
#8 0x00007ff6815e00ff in ?? () from /usr/lib/libtbb.so.2
#9 0x00007ff6815e02f9 in ?? () from /usr/lib/libtbb.so.2
#10 0x00007ff681804182 in start_thread (arg=0x7ff63abfa700) at pthread_create.c:312
#11 0x00007ff680add47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

amit surana (asurana-t)
description: updated
Revision history for this message
Hari Prasad Killi (haripk) wrote :

(gdb) fr 4
#4 0x0000000000fde523 in TaskImpl::execute (this=<optimized out>) at controller/src/base/task.cc:253
253 controller/src/base/task.cc: No such file or directory.
(gdb) p what
$1 = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
    _M_p = 0x7ff60cb3e708 "boost::filesystem::directory_iterator::construct: Too many open files: \"/var/run/netns\""}}

amit surana (asurana-t)
information type: Proprietary → Public
Revision history for this message
Hari Prasad Killi (haripk) wrote :

In R2.22 and in earlier releases, agent was opening a socket for every TBB thread running in agent. The test compute nodes here had 40+ cores and this was resulting in 40 FDs being opened. Agent sets aside 64 FDs for normal operations and link local flows can take up to (max – 64). In this case, since 40+ of the 64 are used up and link local takes the rest up to max, only a few are left for normal operations and we are hitting assert due to FDs being unavailable.

To avoid this, add the following in /etc/contrail/supervisord_vrouter.conf and restart the supervisor_vrouter.
environment=TBB_THREAD_COUNT=8 ; (key value pairs to add to environment)

In R3.0, agent doesn’t open the sockets for every TBB thread.

Revision history for this message
Jeba Paulaiyan (jebap) wrote :

This crash happened in 3.0 2701 Ubuntu 14.04 Kilo Sanity also.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-vrouter-agent'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f96a8b5ccc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
Traceback (most recent call last):
  File "/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19-gdb.py", line 63, in <module>
    from libstdcxx.v6.printers import register_libstdcxx_printers
ImportError: No module named 'libstdcxx'
(gdb) bt
#0 0x00007f96a8b5ccc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f96a8b600d8 in __GI_abort () at abort.c:89
#2 0x00007f96a8b55b86 in __assert_fail_base (fmt=0x7f96a8ca6830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x11bf3f5 "0",
    file=file@entry=0x12b30e0 "controller/src/base/task.cc", line=line@entry=276,
    function=function@entry=0x12b47e0 <TaskImpl::execute()::__PRETTY_FUNCTION__> "virtual tbb::task* TaskImpl::execute()") at assert.c:92
#3 0x00007f96a8b55c32 in __GI___assert_fail (assertion=0x11bf3f5 "0", file=0x12b30e0 "controller/src/base/task.cc", line=276,
    function=0x12b47e0 <TaskImpl::execute()::__PRETTY_FUNCTION__> "virtual tbb::task* TaskImpl::execute()") at assert.c:101
#4 0x000000000116dd03 in TaskImpl::execute (this=<optimized out>) at controller/src/base/task.cc:276
#5 0x00007f96a972bb3a in ?? () from /usr/lib/libtbb.so.2
#6 0x00007f96a9727816 in ?? () from /usr/lib/libtbb.so.2
#7 0x00007f96a9726f4b in ?? () from /usr/lib/libtbb.so.2
#8 0x00007f96a97230ff in ?? () from /usr/lib/libtbb.so.2
#9 0x00007f96a97232f9 in ?? () from /usr/lib/libtbb.so.2
#10 0x00007f96a9947182 in start_thread (arg=0x7f96a19e1700) at pthread_create.c:312
#11 0x00007f96a8c2047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

Core copied to /cs-shared/bugs/1533495-3.0-2701

tags: added: sanity
Revision history for this message
Hari Prasad Killi (haripk) wrote :

We see a different exception in the latter core:

(gdb) p what
$1 = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
    _M_p = 0x7f96980129c8 "basic_string::_S_construct null not valid"}}

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.21.x

Review in progress for https://review.opencontrail.org/16481
Submitter: Praveen K V (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/16482
Submitter: Praveen K V (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/16481
Committed: http://github.org/Juniper/contrail-controller/commit/9d3971c19a99d3cc59e04781be4caf782a5ed543
Submitter: Zuul
Branch: R2.21.x

commit 9d3971c19a99d3cc59e04781be4caf782a5ed543
Author: Praveen K V <email address hidden>
Date: Mon Jan 25 14:42:22 2016 +0530

Create only 1 ksync-socket

We are creating as many ksync sockets as number of TBB threads even
though we only use only the first socket for all operations. Modified
code to create only one ksync socket.

Change-Id: I2f1bf8558c219fc97402f8192c3d9d6cebacaf98
Fixes-Bug: #1533495

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/16482
Committed: http://github.org/Juniper/contrail-controller/commit/4216a9c3b42c909de5b3a7b501ffd127b676a44c
Submitter: Zuul
Branch: R2.20

commit 4216a9c3b42c909de5b3a7b501ffd127b676a44c
Author: Praveen K V <email address hidden>
Date: Mon Jan 25 14:42:22 2016 +0530

Create only 1 ksync-socket

We are creating as many ksync sockets as number of TBB threads even
though we only use only the first socket for all operations. Modified
code to create only one ksync socket.

Change-Id: I2f1bf8558c219fc97402f8192c3d9d6cebacaf98
Fixes-Bug: #1533495

Revision history for this message
Ankit Jain (ankitja) wrote :

The same core observed in mainline build 2704.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/16610
Submitter: Hari Prasad Killi (<email address hidden>)

Revision history for this message
Nipa (nipak) wrote :

https://bugs.launchpad.net/juniperopenstack/+bug/1539308 will track the Discovery Client string NULL exception cause.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/16610
Committed: http://github.org/Juniper/contrail-controller/commit/6aa8881194866dba6e7feda310decc975cb98d6f
Submitter: Zuul
Branch: master

commit 6aa8881194866dba6e7feda310decc975cb98d6f
Author: Hari <email address hidden>
Date: Thu Jan 28 17:15:08 2016 +0530

Do not initialize string with NULL.

This throws "basic_string::_S_construct null not valid" exception.

Change-Id: I942c88561c46e26d6ab40cbe42e09a182d1e65ba
closes-bug: 1533495

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22.x

Review in progress for https://review.opencontrail.org/16686
Submitter: Vinay Vithal Mahuli (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged
Download full text (8.5 KiB)

Reviewed: https://review.opencontrail.org/16686
Committed: http://github.org/Juniper/contrail-controller/commit/156ad0b760f9b532572116d813d7afa695555bea
Submitter: Zuul
Branch: R2.22.x

commit 156ad0b760f9b532572116d813d7afa695555bea
Author: Atul Moghe <email address hidden>
Date: Mon Dec 21 14:29:14 2015 -0800

Cherry pick controller commits from R2.20 to R2.22.x
updating version.info from 2.22 to 2.23 in 2.20 branch
Closes-Bug:#1528370

Change-Id: Ic649422979a926cc5f5b8457c01610b848dc206b

Storage stats daemon fix

Partial-Bug: #1528327
Fixed latency monitor code based on the Ceph 0.94.3 version.
Fixed issues in OSD throughput/IOPs calculation.
Updated code based on the latest Sandesh apis.

Change-Id: I12caf951f84c8b213b1b5ec01371bb68b4c48cb3

Fix contrail-collector back pressure mechanism

contrail-collector DB queue back presssure mechanism was not
working since the DB drop level is initialized to INVALID and
even the water marks levels are INVALID and hence the defer/undefer
callbacks are not called.

Change-Id: Ib28141a69aeed3c4ad6f50abbaed2a285e3e7db2
Partial-Bug: #1528380

Fix Agent crash for flow index tree management

Issue:
------
During a flow index change vrouter-agent triggers a delete
on index tree using new flow handle instead of currently
held flow_handle resulting in flow entry getting associated
to two slots in the flow index tree, which further on flow
entry delete due to aging or eviction never releases the
slot for old flow handle, causing failures for further
insertions in the flow index tree

Fix:
----
Avoid taking flow handle as argument to DeleteByIndex and
use the currently associated flow_handle to remove from tree
Adding assert in DeleteByIndex to catch delete failure
Avoid doing delete from index tree in code paths other than
flow entry index update of flow entry delete.

Add logic for KSync Sock User to Mock vrouter behavior
returning index for an entry if it is already allocated
instead of allocating a new one.

Closes-Bug: 1527425
Change-Id: I10e77fb59650acfdd924a5f1d35d6b8dea03a3f0

Fix discovery dependency issue. Originally made in master branch
via https://review.opencontrail.org/#/c/15749

Change-Id: I5d874de3714074c66fa73bfd7c9119772dc681fd
Partial-Bug: #1530186

Avoid calling get_routing_instances on VN object

Calling get_routing_instances could trigger another read of the VN
if the VN has no routing instance. This is not only inefficient, but
could also cause exception if the VN has disappeared. We can avoid
this by calling getattr.

Change-Id: Ie5500585b9e6c578576276c2c04ec03f32c75112
Partial-Bug: 1528950

Fix Centos 65 agent compilation issues.
Closes-Bug: #1532159

Change-Id: Ia8b77619c80737000d5bd949534c9e0a16967359

Closes-Bug: #1524063, contrail-status is showing contrail-web-ui, even it is not configured, in case of SMLite

Change-Id: I55afc19140b1ce52b3b529a644124705de5ce6a8

Fix a corner case with routing instance delete

Sequence of event that causes the crash
1. Static route config deleted
2. Static Route maanger triggers resolve_trigger_ to re-evaluate static
route config
3. Before the resolve trigger is invoked routing instance is deleted

Resolve trigger calls ProcessStaticRouteConfi...

Read more...

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/18197
Submitter: Hari Prasad Killi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22.x

Review in progress for https://review.opencontrail.org/18198
Submitter: Hari Prasad Killi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.21.x

Review in progress for https://review.opencontrail.org/18199
Submitter: Hari Prasad Killi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/18197
Committed: http://github.org/Juniper/contrail-controller/commit/48bc165b5e00362bc78d3ad5db955ab5081beec1
Submitter: Zuul
Branch: R2.20

commit 48bc165b5e00362bc78d3ad5db955ab5081beec1
Author: Hari <email address hidden>
Date: Sun Mar 6 20:35:40 2016 +0530

Update linklocal flow count only when local port is bound and when local port is closed.

Avoid decrementing it earlier, as otherwise we may overshoot the maximum
number of file descriptors we open for linklocal purposes.

Change-Id: Ie4afccde7ce9e51706fdc85f2d0aac599364509a
closes-bug: 1533495

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/18198
Committed: http://github.org/Juniper/contrail-controller/commit/31cd3272a5b168e7ae66dcb2652ee5d781bb276e
Submitter: Zuul
Branch: R2.22.x

commit 31cd3272a5b168e7ae66dcb2652ee5d781bb276e
Author: Hari <email address hidden>
Date: Sun Mar 6 20:35:40 2016 +0530

Update linklocal flow count only when local port is bound and when local port is closed.

Avoid decrementing it earlier, as otherwise we may overshoot the maximum
number of file descriptors we open for linklocal purposes.

Change-Id: Ie4afccde7ce9e51706fdc85f2d0aac599364509a
closes-bug: 1533495
(cherry picked from commit 48bc165b5e00362bc78d3ad5db955ab5081beec1)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/18199
Committed: http://github.org/Juniper/contrail-controller/commit/d4d574cd9b571919afb5c7d0f7d24af2cda4e5da
Submitter: Zuul
Branch: R2.21.x

commit d4d574cd9b571919afb5c7d0f7d24af2cda4e5da
Author: Hari <email address hidden>
Date: Sun Mar 6 20:35:40 2016 +0530

Update linklocal flow count only when local port is bound and when local port is closed.

Avoid decrementing it earlier, as otherwise we may overshoot the maximum
number of file descriptors we open for linklocal purposes.

Change-Id: Ie4afccde7ce9e51706fdc85f2d0aac599364509a
closes-bug: 1533495
(cherry picked from commit 48bc165b5e00362bc78d3ad5db955ab5081beec1)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.