One of ceph-osd processes doesn't start during deployment

Bug #1419884 reported by Dennis Dmitriev
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Medium
Stanislav Makar
5.1.x
Invalid
High
MOS Maintenance
6.0.x
Invalid
High
MOS Maintenance

Bug Description

System tests: 'ceph_ha_one_controller_compact' and 'migrate_vm_backed_with_ceph', on CI: http://jenkins-product.srt.mirantis.net:8080/job/6.1.system_test.centos.thread_1/32/

When Ceph service was started on the controller, one of two ceph-osd processes wasn't started:

 [root@node-2 ~]# service ceph status
=== mon.node-2 ===
mon.node-2: running {"version":"0.80.7"}
=== osd.1 ===
osd.1: not running.
=== osd.0 ===
osd.0: running {"version":"0.80.7"}
[root@node-2 ~]# echo $?
3

[root@node-1 ~]# ceph status
    cluster 75098087-bb84-4fe5-8be6-55fc900ef80d
     health HEALTH_OK
     monmap e1: 1 mons at {node-2=10.108.17.4:6789/0}, election epoch 1, quorum 0 node-2
     osdmap e22: 6 osds: 5 up, 5 in
      pgmap v46: 1728 pgs, 6 pools, 12859 kB data, 5 objects
            10470 MB used, 236 GB / 246 GB avail
                1728 active+clean

As was found in the ceph logs, the command 'osd crush create-or-move' never appeared for osd.1:

==== osd.0 is starting:
Feb 5 21:53:26 node-2 ceph-mon: 2015-02-05 18:53:26.559788 7eff644ef700 0 mon.node-2@0(leader) e1 handle_command mon_command({"prefix": "auth add", "entity": "osd.0", "caps": ["osd", "allow *", "mon", "allow profile
 osd"]} v 0) v1
Feb 5 21:53:27 node-2 ceph-mon: 2015-02-05 18:53:27.392325 7eff644ef700 0 mon.node-2@0(leader) e1 handle_command mon_command({"prefix": "osd crush create-or-move", "args": ["host=node-2", "root=default"], "id": 0, "
weight": 0.050000000000000003} v 0) v1
Feb 5 21:53:27 node-2 ceph-mon: 2015-02-05 18:53:27.392494 7eff644ef700 0 mon.node-2@0(leader).osd e4 create-or-move crush item name 'osd.0' initial_weight 0.05 at location {host=node-2,root=default}

==== osd.1 is starting:
Feb 5 21:53:32 node-2 ceph-mon: 2015-02-05 18:53:32.797480 7eff644ef700 0 mon.node-2@0(leader) e1 handle_command mon_command({"prefix": "auth add", "entity": "osd.1", "caps": ["osd", "allow *", "mon", "allow profile
 osd"]} v 0) v1
Feb 5 21:53:32 node-2 puppet-user[18429]: (/Stage[main]/Ceph/Service[ceph]) Triggered 'refresh' from 1 events
Feb 5 21:53:32 node-2 puppet-user[18429]: (/Stage[main]/Ceph/Service[ceph]) Evaluated in 11.57 seconds
Feb 5 21:53:32 node-2 puppet-user[18429]: (Class[Ceph]) Starting to evaluate the resource
Feb 5 21:53:33 node-2 puppet-user[18429]: (Class[Ceph]) Evaluated in 0.06 seconds
Feb 5 21:53:33 node-2 puppet-user[18429]: (Stage[main]) Starting to evaluate the resource
Feb 5 21:53:33 node-2 puppet-user[18429]: (Stage[main]) Evaluated in 0.05 seconds

Tags: ceph
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
description: updated
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Ryan Moe (rmoe) wrote :

There are lots of errors like "IOError: [Errno 28] No space left on device" in the console log (http://jenkins-product.srt.mirantis.net:8080/job/6.1.system_test.centos.thread_1/32/consoleFull). What's up with those?

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Problem is still reproduced (no errors about 'No space left on device')
[root@node-1 ~]# service ceph status
=== mon.node-1 ===
mon.node-1: running {"version":"0.80.7"}
=== osd.3 ===
osd.3: running {"version":"0.80.7"}
=== osd.7 ===
osd.7: not running.

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Stanislav Makar (smakar)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Stanislav Makar (smakar)
status: Confirmed → In Progress
Revision history for this message
Stanislav Makar (smakar) wrote :

Investigated the problem- problem is floating
Mainly appeared on controller nodes with ceph-osd role
There are some ideas how to fix, provide patch soon

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I guess this bug should have medium priority and have no backports as the related usecase is not recommended - OSD nodes should not be put to controllers, and there is a note in docs about that

Changed in fuel:
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/169226

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/169226
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=c20b2f8463e962734222382a77f26b91a53fd264
Submitter: Jenkins
Branch: master

commit c20b2f8463e962734222382a77f26b91a53fd264
Author: Stanislav Makar <email address hidden>
Date: Tue Mar 31 09:07:18 2015 +0000

    Fix floating problem with OSD down

    * Add the OSD activation and check if osd process for each OSD is started.
    * Change the process of adding OSDs to cluster. Now Ceph OSDs are added one by
     one instead of by all together. This allows to check the every OSD status
     during adding to cluster.

    Change-Id: I8e64c1b15ed92e6fb5939b6f41728efacae64319
    Closes-bug: #1419884

Changed in fuel:
status: In Progress → Fix Committed
Stanislav Makar (smakar)
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/186792
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=c110d0ae1a8d5b98f1ea5e4f4d762b5ccfdd2ffc
Submitter: Jenkins
Branch: master

commit c110d0ae1a8d5b98f1ea5e4f4d762b5ccfdd2ffc
Author: Stanislav Makar <email address hidden>
Date: Fri May 29 13:36:05 2015 +0000

    Add tests for ceph module

    Related-bug: #1419884
    Fuel-CI: disable
    Change-Id: I17a6f026c278f73fa1dd82c49c338a29bb9c4898

Revision history for this message
Miroslav Anashkin (manashkin) wrote :

Please cherry-pick the fix to Ceph OSD deployment order from 6.1 and add it to the next maintenance updates for 6.0 and 5.1.1.

Set to High since it is related to already issued Tech Bulletin, may require a lot of time to fix and may lead to data loss.
https://online.mirantis.com/hubfs/Technical_Bulletins/Mirantis-Technical-Bulletin-5-Removing-Ceph-OSD-node.pdf

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

We don't support master node updates in 5.1/5.1.1 and 6.0 maintenance updates. Setting to Invalid for 5.1.1 and 6.0 updates.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.