on the customer's ceph-osd server all OSD get down.
Error: on logs:
grep 'void Thread::create(size_t)' *|grep -v '>'|grep " 20:"
ceph-osd.32.log:2016-05-04 20:42:35.135375 7fbd54ea7700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fbd54ea7700 time 2016-05-04 20:42:35.133589
ceph-osd.33.log:2016-05-04 20:42:41.183548 7f8a7d63f700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f8a7d63f700 time 2016-05-04 20:42:41.181656
ceph-osd.34.log:2016-05-04 20:42:38.969135 7fd4d9794700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd4d9794700 time 2016-05-04 20:42:38.967367
ceph-osd.35.log:2016-05-04 20:42:39.191213 7fa632e59700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fa632e59700 time 2016-05-04 20:42:39.189330
ceph-osd.36.log:2016-05-04 20:41:49.592106 7fe44d0d6700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fe44d0d6700 time 2016-05-04 20:41:49.590262
ceph-osd.37.log:2016-05-04 20:41:49.976383 7fa717482700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fa717482700 time 2016-05-04 20:41:49.974213
ceph-osd.37.log:2016-05-04 20:42:08.494309 7fa77aff7700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fa77aff7700 time 2016-05-04 20:42:08.492618
ceph-osd.38.log:2016-05-04 20:41:50.552626 7faed7a0c700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7faed7a0c700 time 2016-05-04 20:41:50.550463
ceph-osd.38.log:2016-05-04 20:41:53.079255 7faf05c11700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7faf05c11700 time 2016-05-04 20:41:53.077452
ceph-osd.39.log:2016-05-04 20:41:49.817492 7f66df053700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f66df053700 time 2016-05-04 20:41:49.815616
ceph-osd.40.log:2016-05-04 20:42:09.614569 7f1217f73700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f1217f73700 time 2016-05-04 20:42:09.612584
ceph-osd.41.log:2016-05-04 20:42:43.608912 7f2b0dfe1700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f2b0dfe1700 time 2016-05-04 20:42:43.607110
ceph-osd.42.log:2016-05-04 20:42:37.984996 7f0e57a3d700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f0e57a3d700 time 2016-05-04 20:42:37.983147
ceph-osd.42.log:2016-05-04 20:42:45.305307 7f0e56a3b700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f0e56a3b700 time 2016-05-04 20:42:45.303521
ceph-osd.43.log:2016-05-04 20:41:49.610410 7fd6c3fac700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fd6c3fac700 time 2016-05-04 20:41:49.608545
ceph-osd.44.log:2016-05-04 20:42:43.303795 7f8888c48700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f8888c48700 time 2016-05-04 20:42:43.301679
ceph-osd.45.log:2016-05-04 20:42:37.684938 7f4eef652700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4eef652700 time 2016-05-04 20:42:37.683107
ceph-osd.46.log:2016-05-04 20:42:35.544142 7fc6ba59a700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fc6ba59a700 time 2016-05-04 20:42:35.525877
ceph-osd.47.log:2016-05-04 20:42:35.479270 7f865cd1f700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f865cd1f700 time 2016-05-04 20:42:35.477467
ceph-osd.48.log:2016-05-04 20:42:38.507663 7ff7e4160700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7ff7e4160700 time 2016-05-04 20:42:38.505868
ceph-osd.49.log:2016-05-04 20:42:41.636908 7f71c767d700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f71c767d700 time 2016-05-04 20:42:41.635148
ceph-osd.50.log:2016-05-04 20:41:49.774139 7ff7eebbb700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7ff7eebbb700 time 2016-05-04 20:41:49.772331
ceph-osd.51.log:2016-05-04 20:42:34.925049 7f4f51293700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f4f51293700 time 2016-05-04 20:42:34.923227
ceph-osd.52.log:2016-05-04 20:41:53.055201 7f337fdc1700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f337fdc1700 time 2016-05-04 20:41:53.053058
ceph-osd.52.log:2016-05-04 20:42:35.038611 7f33ea9c8700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f33ea9c8700 time 2016-05-04 20:42:35.036934
ceph-osd.53.log:2016-05-04 20:42:39.954108 7f1b0cc71700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f1b0cc71700 time 2016-05-04 20:42:39.952291
ceph-osd.54.log:2016-05-04 20:42:40.744019 7f2890034700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f2890034700 time 2016-05-04 20:42:40.742188
for osd-36
2016-05-04 20:41:49.592106 7fe44d0d6700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fe44d0d6700 time 2016-05-04 20:41:49.590262
common/Thread.cc: 129: FAILED assert(ret == 0)
ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
1: (Thread::create(unsigned long)+0x8a) [0x7fe45f4f82aa]
2: (SimpleMessenger::add_accept_pipe(int)+0x59) [0x7fe45f4ef6e9]
3: (Accepter::entry()+0x363) [0x7fe45f59d653]
4: (()+0x8182) [0x7fe45e37f182]
5: (clone()+0x6d) [0x7fe45caf247d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
-10000> 2016-05-04 20:41:39.451877 7fe4470ca700 1 -- 192.168.1.2:6808/32128 --> osd.80 192.168.1.25:6876/31636 -- osd_sub_op(client.5865660.0:518880 12.e9 917234e9/rbd_data.4da8a446de3ea6.00000000000006ab/head//12
-9999> 2016-05-04 20:41:39.451982 7fe4470ca700 5 -- op tracker -- , seq: 5368144, time: 2016-05-04 20:41:39.451982, event: commit_queued_for_journal_write, request: osd_op(client.5865660.0:518880 rbd_data.4da8a446
....
-1> 2016-05-04 20:41:49.573983 7fe44b0d2700 1 -- 192.168.0.12:6853/32128 --> 192.168.0.29:0/1043 -- osd_ping(ping_reply e51632 stamp 2016-05-04 20:41:49.566069) v2 -- ?+0 0x7fe47316b2c0 con 0x7fe46cddbfa0
0> 2016-05-04 20:41:49.592106 7fe44d0d6700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fe44d0d6700 time 2016-05-04 20:41:49.590262
common/Thread.cc: 129: FAILED assert(ret == 0)
ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
1: (Thread::create(unsigned long)+0x8a) [0x7fe45f4f82aa]
2: (SimpleMessenger::add_accept_pipe(int)+0x59) [0x7fe45f4ef6e9]
3: (Accepter::entry()+0x363) [0x7fe45f59d653]
4: (()+0x8182) [0x7fe45e37f182]
5: (clone()+0x6d) [0x7fe45caf247d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
99/99 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.36.log
--- end dump of recent events ---
2016-05-04 20:42:09.919710 7fe45abfc700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe4470ca700' had timed out after 15
2016-05-04 20:51:41.270916 7fe45abfc700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe4468c9700' had timed out after 15
2016-05-04 20:51:41.271454 7fe43d381700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.27:6811/1000475 pipe(0x7fe47289f980 sd=227 :6808 s=2 pgs=2571 cs=1 l=0 c=0x7fe464040c60).fault with nothing to send, going to standby
2016-05-04 20:51:41.271503 7fe432dfd700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.25:6860/31162 pipe(0x7fe46ec97680 sd=271 :6808 s=2 pgs=20 cs=1 l=0 c=0x7fe473f16f20).fault with nothing to send, going to standby
2016-05-04 20:51:41.271524 7fe4403b1700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.16:6811/45083371 pipe(0x7fe46a247180 sd=51 :48193 s=2 pgs=153273 cs=1 l=0 c=0x7fe46a1bf600).fault with nothing to send, going to stan
2016-05-04 20:51:41.271552 7fe42c494700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.25:6890/31971 pipe(0x7fe46cb59900 sd=289 :6808 s=2 pgs=32 cs=1 l=0 c=0x7fe473f17340).fault with nothing to send, going to standby
2016-05-04 20:51:41.271648 7fe42b383700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.26:6876/3004902 pipe(0x7fe46e469480 sd=375 :6808 s=2 pgs=482 cs=1 l=0 c=0x7fe469e87de0).fault with nothing to send, going to standby
2016-05-04 20:51:41.271667 7fe4305d5700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.25:6872/31489 pipe(0x7fe46e41ea80 sd=319 :6808 s=2 pgs=32 cs=1 l=0 c=0x7fe463fcd700).fault with nothing to send, going to standby
2016-05-04 20:51:41.271789 7fe42c292700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.25:6876/31636 pipe(0x7fe471759180 sd=290 :6808 s=2 pgs=6 cs=1 l=0 c=0x7fe463fcd2e0).fault with nothing to send, going to standby
2016-05-04 20:51:41.271813 7fe42e9b9700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.27:6829/2012410 pipe(0x7fe4730c1680 sd=407 :6808 s=2 pgs=1380 cs=1 l=0 c=0x7fe4685f82c0).fault with nothing to send, going to standby
2016-05-04 20:51:41.272412 7fe427343700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.24:6801/1023577 pipe(0x7fe46fb7c300 sd=70 :6808 s=2 pgs=445 cs=1 l=0 c=0x7fe464047c80).fault with nothing to send, going to standby
2016-05-04 20:51:41.272415 7fe433c0b700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.16:6800/94084658 pipe(0x7fe472c82300 sd=158 :6808 s=2 pgs=110229 cs=1 l=0 c=0x7fe464040420).fault with nothing to send, going to stan
2016-05-04 20:51:41.273139 7fe433b0a700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.12:6822/64075135 pipe(0x7fe472c81180 sd=263 :6808 s=2 pgs=82737 cs=1 l=0 c=0x7fe4640402c0).fault with nothing to send, going to stand
2016-05-04 20:51:41.273179 7fe43f7a5700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.24:6800/2014843 pipe(0x7fe46a248580 sd=121 :52404 s=2 pgs=3348 cs=1 l=0 c=0x7fe46a1c1c80).fault with nothing to send, going to standb
2016-05-04 20:51:41.273184 7fe43b55d700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.27:6819/1001043 pipe(0x7fe46d668f80 sd=589 :6808 s=2 pgs=3065 cs=1 l=0 c=0x7fe4640411e0).fault with nothing to send, going to standby
2016-05-04 20:51:41.273312 7fe3f67cd700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.27:6844/9736 pipe(0x7fe46fb8c280 sd=663 :6808 s=2 pgs=25 cs=1 l=0 c=0x7fe46cf1dde0).fault with nothing to send, going to standby
2016-05-04 20:51:41.272914 7fe434514700 0 -- 192.168.1.2:6808/32128 >> 192.168.1.24:6849/22096 pipe(0x7fe46caab680 sd=257 :53349 s=2 pgs=54 cs=1 l=0 c=0x7fe46ca92c60).fault with nothing to send, going to standby
2016-05-04 20:51:41.279274 7fe4448c5700 -1 osd.36 51632 skipping heartbeat_check, hbqueue max age: 580.288
2016-05-04 20:51:41.279379 7fe44a0d0700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe4470ca700' had timed out after 15
2016-05-04 20:51:41.279389 7fe44a0d0700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe4470ca700' had suicide timed out after 150
2016-05-04 20:51:41.282171 7fe44a0d0700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe44a0d0700 time 2016-05-04 20:51:41.279483
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0x7fe45f4587e9]
2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x7fe45f459056]
3: (OSD::handle_osd_ping(MOSDPing*)+0x7bb) [0x7fe45f0d87eb]
4: (OSD::heartbeat_dispatch(Message*)+0x343) [0x7fe45f0d9c53]
5: (DispatchQueue::entry()+0x57a) [0x7fe45f5aff6a]
6: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fe45f4f407d]
7: (()+0x8182) [0x7fe45e37f182]
8: (clone()+0x6d) [0x7fe45caf247d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-05-04 20:51:41.284821 7fe44d0d6700 -1 *** Caught signal (Aborted) **
in thread 7fe44d0d6700
this case is similar to
https://bugzilla.redhat.com/show_bug.cgi?id=1306242
http://tracker.ceph.com/issues/10988
I will provide logs on demand.
This looks like resources exhaustion. Most probably, pid_max limit has been reached (can be increased via sysctl). atop, dmesg might be useful to investigate. Still, it is not clear why it is reported here.