mgr crashs in 16.2.5 / clock-skew

Bug #1943423 reported by sascha arthur
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ceph (Ubuntu)
New
Undecided
Unassigned

Bug Description

Hello,

Running inside an KVM, impish with latest ceph version.
Can at least reproduce it in 3 reinstalled fresh ceph clusters.

Heres the crash info for my mgr's:

ceph crash info 2021-09-12T21:09:22.866793Z_2419107c-082c-457a-b5e6-a376d779b32f

{
    "archived": "2021-09-13 07:59:37.681606",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libc.so.6(+0x46510) [0x7f62e074e510]",
        "(std::_Rb_tree_increment(std::_Rb_tree_node_base*)+0x13) [0x7f62e0af5573]",
        "(PGMap::apply_incremental(ceph::common::CephContext*, PGMap::Incremental const&)+0xb60) [0x5633639f6320]",
        "(ClusterState::notify_osdmap(OSDMap const&)+0x29d) [0x563363a89f1d]",
        "(Mgr::handle_osd_map()+0x854) [0x563363ae0694]",
        "(Mgr::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x568) [0x563363ae0eb8]",
        "(MgrStandby::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xb8) [0x563363af1118]",
        "(Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x450) [0x7f62e1101d30]",
        "(DispatchQueue::entry()+0x647) [0x7f62e10ff0e7]",
        "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f62e11be921]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x988d7) [0x7f62e07a08d7]",
        "/lib/x86_64-linux-gnu/libc.so.6(+0x129510) [0x7f62e0831510]"
    ],
    "ceph_version": "16.2.5",
    "crash_id": "2021-09-12T21:09:22.866793Z_2419107c-082c-457a-b5e6-a376d779b32f",
    "entity_name": "mgr.ceph-00002",
    "os_id": "21.10",
    "os_name": "Ubuntu Impish Indri (development branch)",
    "os_version": "21.10 (Impish Indri)",
    "os_version_id": "21.10",
    "process_name": "ceph-mgr",
    "stack_sig": "eccaccb958ebf382237486176ce43b704db9b0ec4b004a7697e77140821e88e9",
    "timestamp": "2021-09-12T21:09:22.866793Z",
    "utsname_hostname": "ceph-00002.dc-003.xxx",
    "utsname_machine": "x86_64",
    "utsname_release": "5.13.0-14-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#14-Ubuntu SMP Mon Aug 2 12:43:35 UTC 2021"
}

On top MGRs having sometimes "clock-skew" issues, even though the daemon is running its loosing connection and kicked out of the cluster. For sure Host and KVM is ntp synchronized.

Not sure if this "clock-skew" is related to this crash here, but will post the log as soon as i have it again.

Revision history for this message
sascha arthur (sarthur) wrote :

Heres are the last lines when the MGR is running, but kicked out of the cluster:

2021-09-14T00:00:42.062+0000 7f05361b8640 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw|rbd-mirror|cephfs-mirror (PID: 1137578) UID: 0
2021-09-14T00:00:42.530+0000 7f0526ffd640 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-09-13T23:00:42.534987+0000)
2021-09-14T00:00:43.530+0000 7f0526ffd640 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2021-09-13T23:00:43.535155+0000)

Revision history for this message
sascha arthur (sarthur) wrote :

added following cron on top of ntpd, to see if this solves the issue:

0 * * * * /usr/sbin/hwclock -w --verbose --update-drift >> /tmp/hwclock.log

Revision history for this message
sascha arthur (sarthur) wrote :

sadly it didnt solve the issue.. heres more infos about:

https://tracker.ceph.com/issues/23460

sadly i didnt found a solution until now..

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.