nodemgr stays in initializing state ( Cassandra state detected DOWN.)

Bug #1780948 reported by vimal
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
Invalid
Critical
vimal
Trunk
Invalid
Critical
vimal

Bug Description

nodemgr stays in initializing state ( Cassandra state detected DOWN.) Exception: java.lang.OutOfMemoryError is seen in analytics_database_cassandra.

version
----------
ocata-5.0-134

commands
-----------

[root@nodem14 ~]#
[root@nodem14 ~]# contrail-status
Pod Service Original Name State Status
analytics alarm-gen contrail-analytics-alarm-gen running Up 5 hours
analytics api contrail-analytics-api running Up 5 hours
analytics collector contrail-analytics-collector running Up 5 hours
analytics nodemgr contrail-nodemgr running Up 5 hours
analytics query-engine contrail-analytics-query-engine running Up 5 hours
analytics snmp-collector contrail-analytics-snmp-collector running Up 5 hours
analytics topology contrail-analytics-topology running Up 5 hours
config api contrail-controller-config-api running Up About an hour
config cassandra contrail-external-cassandra running Up 5 hours
config device-manager contrail-controller-config-devicemgr running Up 5 hours
config nodemgr contrail-nodemgr running Up 5 hours
config rabbitmq contrail-external-rabbitmq running Up 5 hours
config schema contrail-controller-config-schema running Up 5 hours
config svc-monitor contrail-controller-config-svcmonitor running Up 5 hours
config zookeeper contrail-external-zookeeper running Up 5 hours
control control contrail-controller-control-control running Up 46 minutes
control dns contrail-controller-control-dns running Up 5 hours
control named contrail-controller-control-named running Up 5 hours
control nodemgr contrail-nodemgr running Up 5 hours
database cassandra contrail-external-cassandra running Up 5 hours
database kafka contrail-external-kafka running Up 5 hours
database nodemgr contrail-nodemgr running Up 5 hours
database zookeeper contrail-external-zookeeper running Up 5 hours
webui job contrail-controller-webui-job running Up 5 hours
webui web contrail-controller-webui-web running Up 5 hours

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail config ==
api: active
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: active
nodemgr: initializing (Cassandra state detected DOWN. )
zookeeper: active
cassandra: active

Below logs are seen in analytics_database_cassandra_1

INFO [Service Thread] 2018-07-10 07:10:17,733 StatusLogger.java:51 - GossipStage 1 72 17282 0 0

WARN [ScheduledTasks:2] 2018-07-10 07:10:30,933 NoSpamLogger.java:94 - Some operations timed out, details available at debug level (debug.log)
INFO [Service Thread] 2018-07-10 07:10:32,541 StatusLogger.java:51 - SecondaryIndexManagement 0 0 0 0 0

INFO [HintsDispatcher:10] 2018-07-10 07:10:46,187 HintsDispatchExecutor.java:289 - Finished hinted handoff of file 89f61382-4bfa-4534-805b-6ddb9c5f06ba-1531200407225-1.hints to endpoint /10.204.216.95: 89f61382-4bfa-4534-805b-6ddb9c5f06ba, partially

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "MessagingService-Incoming-/10.204.216.96"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "MessagingService-Incoming-/10.204.216.95"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Reference-Reaper:1"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "MessagingService-Incoming-/10.204.216.95"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
*** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message can't create byte arrau at JPLISAgent.c line: 813
*** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message can't create byte arrau at JPLISAgent.c line: 813

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "MessagingService-Incoming-/10.204.216.95"
Jul 10, 2018 7:37:12 AM sun.rmi.transport.tcp.TCPTransport$AcceptLoop executeAcceptLoop
WARNING: RMI TCP Accept-7200: accept loop for ServerSocket[addr=localhost/127.0.0.1,localport=7200] throws
java.lang.OutOfMemoryError: Java heap space

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
Jul 10, 2018 7:39:44 AM sun.rmi.transport.tcp.TCPTransport$AcceptLoop executeAcceptLoop
WARNING: RMI TCP Accept-7200: accept loop for ServerSocket[addr=localhost/127.0.0.1,localport=7200] throws
java.lang.OutOfMemoryError: Java heap space

ERROR [BatchlogTasks:1] 2018-07-10 07:20:23,338 JVMStabilityInspector.java:74 - OutOfMemory error letting the JVM handle the error:
java.lang.OutOfMemoryError: Java heap space

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "GossipStage:1"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RMI TCP Connection(idle)"
ERROR [Native-Transport-Requests-18] 2018-07-10 07:20:23,338 JVMStabilityInspector.java:74 - OutOfMemory error letting the JVM handle the error:

logs
------

/cs-shared/bugs/1780948
[vappachan@nodem3 1780948]$ ls
contrail-analytics-nodemgr.log contrail-collector.log contrail-config-nodemgr.log contrail-database-nodemgr.log logs
[

Tags: sanity
Changed in juniperopenstack:
assignee: nobody → Sundaresan Rajangam (srajanga)
vimal (vappachan)
description: updated
vimal (vappachan)
description: updated
Revision history for this message
Sundaresan Rajangam (srajanga) wrote :
Download full text (4.3 KiB)

cassandra is started with Xms 1g and Xmx 2g
This is incorrect. Xms and Xmx value should be 8g

cassand+ 1 0 99 04:57 ? 1-00:25:02 java -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+CMSClassUnloadingEnabled -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Xms8192M -Xmx8192M -Xmn2048M -XX:CompileCommandFile=/etc/cassandra/hotspot_compiler -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -Dcassandra.jmx.local.port=7199 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password -Djava.library.path=/usr/share/cassandra/lib/sigar-bin -Xms1g -Xmx2g -Dcassandra.rpc_port=9160 -Dcassandra.native_transport_port=9042 -Dcassandra.ssl_storage_port=7011 -Dcassandra.storage_port=7010 -Dcassandra.jmx.local.port=7200 -Dcassandra.libjemalloc=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 -XX:OnOutOfMemoryError=kill -9 %p -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir=/var/lib/cassandra -Dcassandra-foreground=yes -cp /etc/cassandra:/usr/share/cassandra/lib/HdrHistogram-2.1.9.jar:/usr/share/cassandra/lib/ST4-4.0.8.jar:/usr/share/cassandra/lib/airline-0.6.jar:/usr/share/cassandra/lib/antlr-runtime-3.5.2.jar:/usr/share/cassandra/lib/asm-5.0.4.jar:/usr/share/cassandra/lib/caffeine-2.2.6.jar:/usr/share/cassandra/lib/cassandra-driver-core-3.0.1-shaded.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.9.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/commons-math3-3.2.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrent-trees-2.4.0.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/ecj-4.4.2.jar:/usr/share/cassandra/lib/guava-18.0.jar:/usr/share/cassandra/lib/high-scale-lib-1.0.6.jar:/usr/share/cassandra/lib/hppc-0.5.4.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.13.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/cassandra/lib/jamm-0.3.0.jar:/usr/share/cassandra/lib/javax.inject.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/jctools-core-1.2.1.jar:/usr/share/cassandra/lib/jflex-1.6.0.jar:/usr/share/cassandra/lib/jna-4.2.2.jar:/usr/share/cassandra/lib/joda-time-2.4.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jsta...

Read more...

Revision history for this message
Sudheendra Rao (sudheendra-k) wrote :

the problem is not seen after removing the JVM_EXTRA_OPT,
hence removing the sanityblocker tag

tags: removed: sanityblocker
Revision history for this message
vimal (vappachan) wrote :
Download full text (3.4 KiB)

This issue is seen intermittently . In instances.yaml JVM_EXTRA_OPT is removed. Below is the status with ocata-5.0-137 . Logs are in /cs-shared/bugs/1780948/build137

[root@nodem14 contrail-ansible-deployer]# contrail-status
Pod Service Original Name State Status
analytics alarm-gen contrail-analytics-alarm-gen running Up 10 hours
analytics api contrail-analytics-api running Up 10 hours
analytics collector contrail-analytics-collector running Up 10 hours
analytics nodemgr contrail-nodemgr running Up 10 hours
analytics query-engine contrail-analytics-query-engine running Up 10 hours
analytics snmp-collector contrail-analytics-snmp-collector running Up 10 hours
analytics topology contrail-analytics-topology running Up 10 hours
config api contrail-controller-config-api running Up 7 hours
config device-manager contrail-controller-config-devicemgr running Up 10 hours
config nodemgr contrail-nodemgr running Up 10 hours
config schema contrail-controller-config-schema running Up 10 hours
config svc-monitor contrail-controller-config-svcmonitor running Up 10 hours
config-database cassandra contrail-external-cassandra running Up 10 hours
config-database nodemgr contrail-nodemgr restarting Restarting (0) 3 hours ago
config-database rabbitmq contrail-external-rabbitmq running Up 10 hours
config-database zookeeper contrail-external-zookeeper running Up 10 hours
control control contrail-controller-control-control running Up 7 hours
control dns contrail-controller-control-dns running Up 10 hours
control named contrail-controller-control-named running Up 10 hours
control nodemgr contrail-nodemgr running Up 10 hours
database cassandra contrail-external-cassandra running Up 10 hours
database kafka contrail-external-kafka running Up 10 hours
database nodemgr contrail-nodemgr running Up 10 hours
database zookeeper contrail-external-zookeeper running Up 10 hours
webui job contrail-controller-webui-job running Up 10 hours
webui web contrail-controller-webui-web running Up 10 hours

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail config-database ==

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail webui ...

Read more...

tags: added: sanityblocker
Revision history for this message
Santosh Gupta (sangupta) wrote :

I see this on system.log on config-cassandra.

WARN [main] 2018-07-12 09:26:12,892 NativeLibrary.java:187 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.

[root@nodec7 sangupta]# free -g
              total used free shared buff/cache available
Mem: 31 25 0 0 4 5
Swap: 0 0 0

[root@nodec7 sangupta]# top -o %MEM

top - 00:12:05 up 10:23, 4 users, load average: 0.35, 0.61, 0.66
Tasks: 346 total, 1 running, 345 sleeping, 0 stopped, 0 zombie
%Cpu(s): 20.6 us, 2.9 sy, 0.0 ni, 76.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32753116 total, 1031864 free, 26521596 used, 5199656 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 5721816 avail Mem

The VM is all-in-one setup and looks under resourced. Have you been running all-in-one on this setup earlier?

Revision history for this message
vimal (vappachan) wrote :

This issue is seen in 2 testbeds. We were running sanity without any issues on these 2 testbeds.

Revision history for this message
Santosh Gupta (sangupta) wrote :

Services look good in the container.
contrail-status error is always showing for config_database_cassandra_1/config_database_zookeeper_1
Assigning to Andrey to check if contrail-status needs fix for the new roles for config cassandra/zookeeper.

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

please provide full info about setup - I see that containers 5.0-137. which version of ansible-deployer you are using? how much memory/cpu/disk it has?

@Santosh, new nodemgr is present in build 5.0-18? and above.

@Sudhee, @Vimal - without JVM_EXTRA_OPT this all-in-one VM can be over-resourced. you can set this option at least for configdb.

Revision history for this message
Sudheendra Rao (sudheendra-k) wrote :

removing sanityblocker as problem is not seen in the recent build, but will monitor the bug for few more builds before closing.

tags: added: sanity
removed: sanityblocker
Revision history for this message
Sudheendra Rao (sudheendra-k) wrote :

problem was due to partial commit of the bug 1765487, the problem is not seen after this bug is fixed, hence closing the bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.