Fuel for OpenStack

One of ceph-osd processes doesn't start during deployment

Bug #1419884 reported by Dennis Dmitriev on 2015-02-09

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	Medium	Stanislav Makar	Fuel for OpenStack 6.1
5.1.x	Invalid	High	MOS Maintenance	Fuel for OpenStack 5.1.1-updates
6.0.x	Invalid	High	MOS Maintenance	Fuel for OpenStack 6.0-updates

Bug Description

System tests: 'ceph_ha_one_controller_compact' and 'migrate_vm_backed_with_ceph', on CI: http://jenkins-product.srt.mirantis.net:8080/job/6.1.system_test.centos.thread_1/32/

When Ceph service was started on the controller, one of two ceph-osd processes wasn't started:

[root@node-2 ~]# service ceph status
=== mon.node-2 ===
mon.node-2: running {"version":"0.80.7"}
=== osd.1 ===
osd.1: not running.
=== osd.0 ===
osd.0: running {"version":"0.80.7"}
[root@node-2 ~]# echo $?
3

[root@node-1 ~]# ceph status
    cluster 75098087-bb84-4fe5-8be6-55fc900ef80d
     health HEALTH_OK
     monmap e1: 1 mons at {node-2=10.108.17.4:6789/0}, election epoch 1, quorum 0 node-2
     osdmap e22: 6 osds: 5 up, 5 in
      pgmap v46: 1728 pgs, 6 pools, 12859 kB data, 5 objects
            10470 MB used, 236 GB / 246 GB avail
                1728 active+clean

As was found in the ceph logs, the command 'osd crush create-or-move' never appeared for osd.1:

==== osd.0 is starting:
Feb 5 21:53:26 node-2 ceph-mon: 2015-02-05 18:53:26.559788 7eff644ef700 0 mon.node-2@0(leader) e1 handle_command mon_command({"prefix": "auth add", "entity": "osd.0", "caps": ["osd", "allow *", "mon", "allow profile
osd"]} v 0) v1
Feb 5 21:53:27 node-2 ceph-mon: 2015-02-05 18:53:27.392325 7eff644ef700 0 mon.node-2@0(leader) e1 handle_command mon_command({"prefix": "osd crush create-or-move", "args": ["host=node-2", "root=default"], "id": 0, "
weight": 0.050000000000000003} v 0) v1
Feb 5 21:53:27 node-2 ceph-mon: 2015-02-05 18:53:27.392494 7eff644ef700 0 mon.node-2@0(leader).osd e4 create-or-move crush item name 'osd.0' initial_weight 0.05 at location {host=node-2,root=default}

==== osd.1 is starting:
Feb 5 21:53:32 node-2 ceph-mon: 2015-02-05 18:53:32.797480 7eff644ef700 0 mon.node-2@0(leader) e1 handle_command mon_command({"prefix": "auth add", "entity": "osd.1", "caps": ["osd", "allow *", "mon", "allow profile
osd"]} v 0) v1
Feb 5 21:53:32 node-2 puppet-user[18429]: (/Stage[main]/Ceph/Service[ceph]) Triggered 'refresh' from 1 events
Feb 5 21:53:32 node-2 puppet-user[18429]: (/Stage[main]/Ceph/Service[ceph]) Evaluated in 11.57 seconds
Feb 5 21:53:32 node-2 puppet-user[18429]: (Class[Ceph]) Starting to evaluate the resource
Feb 5 21:53:33 node-2 puppet-user[18429]: (Class[Ceph]) Evaluated in 0.06 seconds
Feb 5 21:53:33 node-2 puppet-user[18429]: (Stage[main]) Starting to evaluate the resource
Feb 5 21:53:33 node-2 puppet-user[18429]: (Stage[main]) Evaluated in 0.05 seconds

See original description

Tags:

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-02-09:

fail_error_ceph_ha_one_controller_compact-2015_02_05__03_11_20.tar.gz Edit (17.7 MiB, application/x-tar)

Revision history for this message

Dennis Dmitriev (ddmitriev) wrote on 2015-02-09:

fail_error_migrate_vm_backed_with_ceph-2015_02_05__04_04_40.tar.gz Edit (24.9 MiB, application/x-tar)

description:

updated

Bogdan Dobrelya (bogdando) on 2015-02-12

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → Fuel Library Team (fuel-library)

Revision history for this message

Ryan Moe (rmoe) wrote on 2015-02-12:

There are lots of errors like "IOError: [Errno 28] No space left on device" in the console log (http://jenkins-product.srt.mirantis.net:8080/job/6.1.system_test.centos.thread_1/32/consoleFull). What's up with those?

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2015-03-02:

Problem is still reproduced (no errors about 'No space left on device')
[root@node-1 ~]# service ceph status
=== mon.node-1 ===
mon.node-1: running {"version":"0.80.7"}
=== osd.3 ===
osd.3: running {"version":"0.80.7"}
=== osd.7 ===
osd.7: not running.

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2015-03-02:

fail_error_ceph_ha-2015_03_02__06_13_09.tar.gz Edit (29.2 MiB, application/x-tar)

Stanislav Makar (smakar) on 2015-03-10

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Stanislav Makar (smakar)
status:	Confirmed → In Progress

Revision history for this message

Stanislav Makar (smakar) wrote on 2015-03-19:

Investigated the problem- problem is floating
Mainly appeared on controller nodes with ceph-osd role
There are some ideas how to fix, provide patch soon

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-30:

I guess this bug should have medium priority and have no backports as the related usecase is not recommended - OSD nodes should not be put to controllers, and there is a note in docs about that

Changed in fuel:
importance:	High → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-31: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/169226

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-06: Fix merged to fuel-library (master)

#10

Reviewed: https://review.openstack.org/169226
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=c20b2f8463e962734222382a77f26b91a53fd264
Submitter: Jenkins
Branch: master

commit c20b2f8463e962734222382a77f26b91a53fd264
Author: Stanislav Makar <email address hidden>
Date: Tue Mar 31 09:07:18 2015 +0000

Fix floating problem with OSD down

    * Add the OSD activation and check if osd process for each OSD is started.
    * Change the process of adding OSDs to cluster. Now Ceph OSDs are added one by
     one instead of by all together. This allows to check the every OSD status
     during adding to cluster.

Change-Id: I8e64c1b15ed92e6fb5939b6f41728efacae64319
Closes-bug: #1419884

Changed in fuel:
status:	In Progress → Fix Committed

Stanislav Makar (smakar) on 2015-05-27

Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-21: Related fix merged to fuel-library (master)

#11

Reviewed: https://review.openstack.org/186792
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=c110d0ae1a8d5b98f1ea5e4f4d762b5ccfdd2ffc
Submitter: Jenkins
Branch: master

commit c110d0ae1a8d5b98f1ea5e4f4d762b5ccfdd2ffc
Author: Stanislav Makar <email address hidden>
Date: Fri May 29 13:36:05 2015 +0000

Add tests for ceph module

    Related-bug: #1419884
    Fuel-CI: disable
    Change-Id: I17a6f026c278f73fa1dd82c49c338a29bb9c4898

Revision history for this message

Miroslav Anashkin (manashkin) wrote on 2015-07-30:

#12

Please cherry-pick the fix to Ceph OSD deployment order from 6.1 and add it to the next maintenance updates for 6.0 and 5.1.1.

Set to High since it is related to already issued Tech Bulletin, may require a lot of time to fix and may lead to data loss.
https://online.mirantis.com/hubfs/Technical_Bulletins/Mirantis-Technical-Bulletin-5-Removing-Ceph-OSD-node.pdf

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-08-05:

#13

We don't support master node updates in 5.1/5.1.1 and 6.0 maintenance updates. Setting to Invalid for 5.1.1 and 6.0 updates.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.