Juniper Openstack

Bug #1764493
Comment #10

Comment 10 for bug 1764493

Revision history for this message

Andrey Pavlov (apavlov-e) wrote on 2018-04-17: Re: Debugging required on k8s sanity setup which failed for R5.0-16

#10

btw, memory change for cassandra was merged recently -
https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <email address hidden> wrote:

> root@node-10-1-56-124:/# nodetool -p 7200 status
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns (effective) Host ID
> Rack
> UN 10.1.56.125 3.11 MiB 256 68.5%
> 468a1809-53ee-4242-971f-3015ccedc6c2 rack1
> UN 10.1.56.124 1.89 MiB 256 72.2%
> 9aa41a48-3e9c-417d-b25c-7abf5e1f94aa rack1
> UN 10.1.56.126 3.63 MiB 256 59.3%
> 33e498c9-f3e2-4430-86b4-261b0ffbaa0e rack1
>
> root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> running
> root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> running
> root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> running
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <email address hidden>
> wrote:
>
>> Hi Andrey, did you check nodetool status?
>>
>> Regards,
>> Michael
>>
>> Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <email address hidden>:
>>
>> Hey Michael,
>>
>> I have similar problems in my 3-nodes setup:
>>
>> == Contrail control ==
>> control: active
>> nodemgr: active
>> named: active
>> dns: active
>>
>> == Contrail analytics ==
>> snmp-collector: initializing (Database:Cassandra[] connection down)
>> query-engine: active
>> api: active
>> alarm-gen: initializing (Database:Cassandra[] connection down)
>> nodemgr: active
>> collector: initializing (Database:Cassandra connection down)
>> topology: initializing (Database:Cassandra[] connection down)
>>
>> == Contrail config ==
>> api: initializing (Database:Cassandra[] connection down)
>> zookeeper: active
>> svc-monitor: backup
>> nodemgr: active
>> device-manager: backup
>> cassandra: active
>> rabbitmq: active
>> schema: backup
>>
>> == Contrail webui ==
>> web: active
>> job: active
>>
>> == Contrail database ==
>> kafka: active
>> nodemgr: active
>> zookeeper: active
>> cassandra: active
>>
>> [root@node-10-1-56-124 ~]# free -hw
>> total used free shared buffers
>> cache available
>> Mem: 15G 11G 3.3G 28M 0B
>> 892M 3.7G
>> Swap: 0B 0B 0B
>>
>>
>> Regards,
>> Andrey Pavlov.
>>
>> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <email address hidden>
>> wrote:
>>
>>> Pulkit,
>>>
>>> How many resources did you assign to your instances?
>>>
>>> Regards,
>>> Michael
>>>
>>> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <email address hidden>:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> I need your help and expertise debugging the k8s sanity setup which is
>>> in really bad state. Things are messier starting build 15.
>>>
>>> I observed multiple problems on current attempt. Not sure if they are
>>> linked or all are different.
>>>
>>> Kept the setup in same setup so that you can debug the failures on live
>>> setup.
>>>
>>>
>>>
>>> *K8s HA Setup details:*
>>>
>>> 3 Controller+kube managers:
>>>
>>> 10.204.217.52(nodeg12)
>>>
>>> 10.204.217.71(nodeg31)
>>>
>>> 10.204.217.98(nodec58)
>>>
>>> 2 Agents/ k8s slave:
>>>
>>> 10.204.217.100(nodec60)
>>>
>>> 10.204.217.101(nodec61)
>>>
>>> Multi interface setup
>>>
>>>
>>>
>>> Following are key observations:
>>>
>>> 1. RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58
>>> has rabbitmq as inactive.
>>>
>>> rabbitmq: inactive
>>>
>>> Docker logs for rabbitmq container on nodec58:
>>>
>>> {"init terminating in do_boot",{error,{inconsistent_cluster,"Node
>>> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
>>> contrail@nodeg31 disagrees"}}}
>>>
>>>
>>>
>>> 2. On all 3 controllers, Cassandra connection not established for
>>> 2 hours after provisioning. This issue seems flapping with time and
>>> sometimes, I see the services as active too:
>>> control: initializing (Database:Cassandra connection down)
>>> collector: initializing (Database:Cassandra connection down)
>>>
>>>
>>>
>>> 3. If I create a k8s Pod, many a times it results in POD
>>> creation failure and instantly vrouter crash happens.
>>> The trace is below.
>>> Irrespective of crash happens or not, POD creation fails
>>>
>>>
>>>
>>> 4. ON CNI of both agent, seeing this error:
>>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request.
>>> Operation : GET Url : http://127.0.0.1:9091/vm/7a271
>>> 412-4237-11e8-8997-002590c55f6a
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__127.0.0.1-3A9091_vm_7a271412-2D4237-2D11e8-2D8997-2D002590c55f6a&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=IIpdzrKFE-fFt447an76T47XLh_Zf5gZoVC_UG0ewoQ&m=yFx0bC6UO1iMysgd4gZj98BAqXoWPfbe8j62INXaNIc&s=evcN3Jb-APwu9GUfDx2SAUSxXMlvx8vb6qBYW6zU-RE&e=>
>>>
>>> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get
>>> operation. Return code 404
>>>
>>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get
>>> vrouter failed
>>>
>>> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
>>>
>>> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
>>>
>>> E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
>>> processing Add command.
>>>
>>> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
>>>
>>> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
>>>
>>> E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
>>> processing Add command.
>>>
>>>
>>>
>>> *NOTE: Most of the issues observed are on k8s HA multi interface setup.*
>>>
>>> * Things are better with Non HA/ single interface setup. *
>>>
>>>
>>>
>>>
>>>
>>> Agent crash trace:
>>>
>>> (gdb) bt full
>>>
>>> #0 0x00007fb9817761f7 in raise () from /lib64/libc.so.6
>>>
>>> No symbol table info available.
>>>
>>> #1 0x00007fb9817778e8 in abort () from /lib64/libc.so.6
>>>
>>> No symbol table info available.
>>>
>>> #2 0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
>>>
>>> No symbol table info available.
>>>
>>> #3 0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
>>>
>>> No symbol table info available.
>>>
>>> #4 0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*,
>>> DBEntry*) ()
>>>
>>> No symbol table info available.
>>>
>>> #5 0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList() ()
>>>
>>> No symbol table info available.
>>>
>>> #6 0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
>>>
>>> No symbol table info available.
>>>
>>> #7 0x0000000000e9e64f in TaskImpl::execute() ()
>>>
>>> No symbol table info available.
>>>
>>> #8 0x00007fb9823458ca in tbb::internal::custom_schedule
>>> r<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&,
>>> tbb::task*) () from /lib64/libtbb.so.2
>>>
>>> No symbol table info available.
>>>
>>> #9 0x00007fb9823415b6 in tbb::internal::arena::process(
>>> tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
>>>
>>> No symbol table info available.
>>>
>>> #10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&) ()
>>> from /lib64/libtbb.so.2
>>>
>>> No symbol table info available.
>>>
>>> #11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run() ()
>>> from /lib64/libtbb.so.2
>>>
>>> No symbol table info available.
>>>
>>> #12 0x00007fb98233e879 in tbb::internal::rml::private_worker::thread_routine(void*)
>>> () from /lib64/libtbb.so.2
>>>
>>> No symbol table info available.
>>>
>>> #13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0
>>>
>>> No symbol table info available.
>>>
>>> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6
>>>
>>>
>>>
>>>
>>>
>>> Thanks!
>>>
>>> Pulkit Tandon
>>>
>>>
>>>
>>>
>>
>

btw, memory change for cassandra was merged recently -
https://review.opencontrail.org/#/c/41767/1/containers/external/cassandra/contrail-entrypoint.sh

Regards,
Andrey Pavlov.

On Tue, Apr 17, 2018 at 4:19 PM, Andrey Pavlov <andrey.mp@gmail.com> wrote:

> root@node-10-1-56-124:/# nodetool -p 7200 status
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address      Load       Tokens       Owns (effective)  Host ID
>                        Rack
> UN  10.1.56.125  3.11 MiB   256          68.5%
>  468a1809-53ee-4242-971f-3015ccedc6c2  rack1
> UN  10.1.56.124  1.89 MiB   256          72.2%
>  9aa41a48-3e9c-417d-b25c-7abf5e1f94aa  rack1
> UN  10.1.56.126  3.63 MiB   256          59.3%
>  33e498c9-f3e2-4430-86b4-261b0ffbaa0e  rack1
>
> root@node-10-1-56-124:/# nodetool -p 7200 statusgossip
> running
> root@node-10-1-56-124:/# nodetool -p 7200 statusthrift
> running
> root@node-10-1-56-124:/# nodetool -p 7200 statusbinary
> running
>
>
> Regards,
> Andrey Pavlov.
>
> On Tue, Apr 17, 2018 at 4:16 PM, Michael Henkel <mhenkel@juniper.net>
> wrote:
>
>> Hi Andrey, did you check nodetool status?
>>
>> Regards,
>> Michael
>>
>> Am 17.04.2018 um 06:04 schrieb Andrey Pavlov <andrey.mp@gmail.com>:
>>
>> Hey Michael,
>>
>> I have similar problems in my 3-nodes setup:
>>
>> == Contrail control ==
>> control: active
>> nodemgr: active
>> named: active
>> dns: active
>>
>> == Contrail analytics ==
>> snmp-collector: initializing (Database:Cassandra[] connection down)
>> query-engine: active
>> api: active
>> alarm-gen: initializing (Database:Cassandra[] connection down)
>> nodemgr: active
>> collector: initializing (Database:Cassandra connection down)
>> topology: initializing (Database:Cassandra[] connection down)
>>
>> == Contrail config ==
>> api: initializing (Database:Cassandra[] connection down)
>> zookeeper: active
>> svc-monitor: backup
>> nodemgr: active
>> device-manager: backup
>> cassandra: active
>> rabbitmq: active
>> schema: backup
>>
>> == Contrail webui ==
>> web: active
>> job: active
>>
>> == Contrail database ==
>> kafka: active
>> nodemgr: active
>> zookeeper: active
>> cassandra: active
>>
>> [root@node-10-1-56-124 ~]# free -hw
>>               total        used        free      shared     buffers
>>  cache   available
>> Mem:            15G         11G        3.3G         28M          0B
>>   892M        3.7G
>> Swap:            0B          0B          0B
>>
>>
>> Regards,
>> Andrey Pavlov.
>>
>> On Tue, Apr 17, 2018 at 3:57 PM, Michael Henkel <mhenkel@juniper.net>
>> wrote:
>>
>>> Pulkit,
>>>
>>> How many resources did you assign to your instances?
>>>
>>> Regards,
>>> Michael
>>>
>>> Am 17.04.2018 um 05:37 schrieb Pulkit Tandon <pulkitt@juniper.net>:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> I need your help and expertise debugging the k8s sanity setup which is
>>> in really bad state. Things are messier starting build 15.
>>>
>>> I observed multiple problems on current attempt. Not sure if they are
>>> linked or all are different.
>>>
>>> Kept the setup in same setup so that you can debug the failures on live
>>> setup.
>>>
>>>
>>>
>>> *K8s HA Setup details:*
>>>
>>> 3 Controller+kube managers:
>>>
>>> 10.204.217.52(nodeg12)
>>>
>>> 10.204.217.71(nodeg31)
>>>
>>> 10.204.217.98(nodec58)
>>>
>>> 2 Agents/ k8s slave:
>>>
>>> 10.204.217.100(nodec60)
>>>
>>> 10.204.217.101(nodec61)
>>>
>>> Multi interface setup
>>>
>>>
>>>
>>> Following are key observations:
>>>
>>> 1.       RabbitMQ cluster formed between nodeg12 and nodeg31. Nodec58
>>> has rabbitmq as inactive.
>>>
>>> rabbitmq: inactive
>>>
>>> Docker logs for rabbitmq container on nodec58:
>>>
>>> {"init terminating in do_boot",{error,{inconsistent_cluster,"Node
>>> contrail@nodec58 thinks it's clustered with node contrail@nodeg31, but
>>> contrail@nodeg31 disagrees"}}}
>>>
>>>
>>>
>>> 2.       On all 3 controllers, Cassandra connection not established for
>>> 2 hours after provisioning. This issue seems flapping with time and
>>> sometimes, I see the services as active too:
>>> control: initializing (Database:Cassandra connection down)
>>> collector: initializing (Database:Cassandra connection down)
>>>
>>>
>>>
>>> 3.       If I create a k8s  Pod, many a times it results in POD
>>> creation failure and instantly vrouter crash happens.
>>> The trace is below.
>>> Irrespective of crash happens or not, POD creation fails
>>>
>>>
>>>
>>> 4.       ON CNI of both agent, seeing this error:
>>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:79: VRouter request.
>>> Operation : GET Url :  http://127.0.0.1:9091/vm/7a271
>>> 412-4237-11e8-8997-002590c55f6a
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__127.0.0.1-3A9091_vm_7a271412-2D4237-2D11e8-2D8997-2D002590c55f6a&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=IIpdzrKFE-fFt447an76T47XLh_Zf5gZoVC_UG0ewoQ&m=yFx0bC6UO1iMysgd4gZj98BAqXoWPfbe8j62INXaNIc&s=evcN3Jb-APwu9GUfDx2SAUSxXMlvx8vb6qBYW6zU-RE&e=>
>>>
>>> E : 24646 : 2018/04/17 17:35:44 vrouter.go:147: Failed HTTP Get
>>> operation. Return code 404
>>>
>>> I : 24646 : 2018/04/17 17:35:44 vrouter.go:181: Iteration 14 : Get
>>> vrouter failed
>>>
>>> E : 24633 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
>>>
>>> I : 24633 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
>>>
>>> E : 24633 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
>>> processing Add command.
>>>
>>> E : 24646 : 2018/04/17 17:35:49 vrouter.go:287: Error in polling VRouter
>>>
>>> I : 24646 : 2018/04/17 17:35:49 cni.go:175: Error in Add to VRouter
>>>
>>> E : 24646 : 2018/04/17 17:35:49 contrail-kube-cni.go:67: Failed
>>> processing Add command.
>>>
>>>
>>>
>>> *NOTE: Most of the issues observed are on k8s HA multi interface setup.*
>>>
>>> *             Things are better with Non HA/ single interface setup. *
>>>
>>>
>>>
>>>
>>>
>>> Agent crash trace:
>>>
>>> (gdb) bt full
>>>
>>> #0  0x00007fb9817761f7 in raise () from /lib64/libc.so.6
>>>
>>> No symbol table info available.
>>>
>>> #1  0x00007fb9817778e8 in abort () from /lib64/libc.so.6
>>>
>>> No symbol table info available.
>>>
>>> #2  0x00007fb98176f266 in __assert_fail_base () from /lib64/libc.so.6
>>>
>>> No symbol table info available.
>>>
>>> #3  0x00007fb98176f312 in __assert_fail () from /lib64/libc.so.6
>>>
>>> No symbol table info available.
>>>
>>> #4  0x0000000000c15440 in AgentOperDBTable::ConfigEventHandler(IFMapNode*,
>>> DBEntry*) ()
>>>
>>> No symbol table info available.
>>>
>>> #5  0x0000000000c41714 in IFMapDependencyManager::ProcessChangeList() ()
>>>
>>> No symbol table info available.
>>>
>>> #6  0x0000000000ea4a57 in TaskTrigger::WorkerTask::Run() ()
>>>
>>> No symbol table info available.
>>>
>>> #7  0x0000000000e9e64f in TaskImpl::execute() ()
>>>
>>> No symbol table info available.
>>>
>>> #8  0x00007fb9823458ca in tbb::internal::custom_schedule
>>> r<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&,
>>> tbb::task*) () from /lib64/libtbb.so.2
>>>
>>> No symbol table info available.
>>>
>>> #9  0x00007fb9823415b6 in tbb::internal::arena::process(
>>> tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
>>>
>>> No symbol table info available.
>>>
>>> #10 0x00007fb982340c8b in tbb::internal::market::process(rml::job&) ()
>>> from /lib64/libtbb.so.2
>>>
>>> No symbol table info available.
>>>
>>> #11 0x00007fb98233e67f in tbb::internal::rml::private_worker::run() ()
>>> from /lib64/libtbb.so.2
>>>
>>> No symbol table info available.
>>>
>>> #12 0x00007fb98233e879 in tbb::internal::rml::private_worker::thread_routine(void*)
>>> () from /lib64/libtbb.so.2
>>>
>>> No symbol table info available.
>>>
>>> #13 0x00007fb982560e25 in start_thread () from /lib64/libpthread.so.0
>>>
>>> No symbol table info available.
>>>
>>> #14 0x00007fb98183934d in clone () from /lib64/libc.so.6
>>>
>>>
>>>
>>>
>>>
>>> Thanks!
>>>
>>> Pulkit Tandon
>>>
>>>
>>>
>>>
>>
>