Ceph dashboard is not enabled and hook fails "dashboard-relation-changed"

Bug #1952282 reported by Bas de Bruijne
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Ceph Dashboard Charm
Fix Committed
High
Unassigned
Quincy.2
Fix Released
Undecided
Unassigned

Bug Description

Run fails on juju wait timeout because ceph dashboard dies:

------------------------------------------------
ceph-mon/0* waiting executing 0/lxd/1 10.246.65.28 Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (6)
  ceph-dashboard/0* blocked idle 10.246.65.28 Dashboard is not enabled
  logrotated/15 active idle 10.246.65.28 Unit is ready.
ceph-mon/1 active executing 2/lxd/1 10.246.65.55 Unit is ready and clustered
  ceph-dashboard/1 error idle 10.246.65.55 hook failed: "dashboard-relation-changed"
  logrotated/20 active idle 10.246.65.55 Unit is ready.
ceph-mon/2 active executing 4/lxd/1 10.246.65.52 Unit is ready and clustered
  ceph-dashboard/2 error idle 10.246.65.52 hook failed: "dashboard-relation-changed"
  logrotated/22 active idle 10.246.65.52 Unit is ready.
------------------------------------------------

Ceph dashboard log:
------------------------------------------------
2021-11-24 17:26:00 INFO unit.ceph-dashboard/2.juju-log server.go:327 dashboard:74: Requesting a CA certificate. Common name: juju-af480f-4-lxd-1.prodymcprodface.solutionsqa, SANS: ['10.246.65.20', 'juju-af480f-4-lxd-1']
2021-11-24 17:26:01 ERROR unit.ceph-dashboard/2.juju-log server.go:327 dashboard:74: Command failed: b"Error ENOTSUP: Module 'dashboard' is not enabled (required by command 'dashboard debug'): use `ceph mgr module enable dashboard` to enable it\n"
Traceback (most recent call last):
  File "./src/charm.py", line 378, in _run_cmd
    output = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ceph', 'dashboard', 'debug', 'disable']' returned non-zero exit status 95.
------------------------------------------------

Later:
------------------------------------------------
2021-11-24 17:26:11 ERROR unit.ceph-dashboard/2.juju-log server.go:327 dashboard:74: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 597, in <module>
    main(CephDashboardCharm)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/main.py", line 406, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/main.py", line 140, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 278, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 722, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 767, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/src/interface_dashboard.py", line 50, in on_changed
    self.on.mon_ready.emit()
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 278, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 722, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 767, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 427, in _configure_dashboard
    self._configure_tls()
  File "./src/charm.py", line 539, in _configure_tls
    ceph_utils.dashboard_set_ssl_certificate(
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/charms_ceph/utils.py", line 3527, in _dashboard_set_ssl_artifact
    subprocess.check_call(cmd)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ceph', 'dashboard', 'set-ssl-certificate', 'juju-af480f-4-lxd-1', '-i', PosixPath('/etc/ceph/ceph-dashboard.crt')]' returned non-zero exit status 95.
2021-11-24 17:26:11 ERROR juju.worker.uniter.operation runhook.go:146 hook "dashboard-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1
------------------------------------------------

Testruns where this happened (3 today):
https://solutions.qa.canonical.com/testruns/testRun/6a264856-6a03-4995-b531-fa5f936ba7ad
https://solutions.qa.canonical.com/testruns/testRun/eaf26a04-7428-41e4-8fb7-93941b0112eb
https://solutions.qa.canonical.com/testruns/testRun/abf5ac6c-3b5a-405b-a565-719123d8703d

With artifacts respectively:
https://oil-jenkins.canonical.com/artifacts/6a264856-6a03-4995-b531-fa5f936ba7ad/index.html
https://oil-jenkins.canonical.com/artifacts/eaf26a04-7428-41e4-8fb7-93941b0112eb/index.html
https://oil-jenkins.canonical.com/artifacts/abf5ac6c-3b5a-405b-a565-719123d8703d/index.html

We also had a run with the same configuration where this did not happen and ceph-dashboard is happy:
https://solutions.qa.canonical.com/testruns/testRun/276098a9-923d-4d92-86e0-3d177b13e6b9
with artifacts: https://oil-jenkins.canonical.com/artifacts/276098a9-923d-4d92-86e0-3d177b13e6b9/index.html

Future occurrences can be found here: https://solutions.qa.canonical.com/bugs/bugs/bug/1952282

description: updated
Revision history for this message
Billy Olsen (billy-olsen) wrote :

It looks like this is happening on the non-leader units, which suggests there's a bit of a race. The leader unit will check to see if the dashboard is enabled or not, and if its not enabled it will enable it. However, all units will attempt to apply the charm options [0] to the dashboard, regardless of whether or not the dashboard module is enabled - which is where things run into problems.

From the logs, the leader unit (ceph-dashboard/0, machine 0/lxd/1) successfully executes this sequence of events at 04:51:11 whereas ceph-dashboard/1 fails this sequence of events at 04:49:31. I think the non-leader units need to check whether or not the dashboard is enabled prior to setting the configuration or defer the event.

[0] - https://opendev.org/openstack/charm-ceph-dashboard/src/branch/master/src/charm.py#L426

Changed in charm-ceph-dashboard:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Alexander Balderson (asbalderson) wrote :

we're using the next version of the charm on these deployments.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :
Download full text (3.2 KiB)

I'm also seeing this with the message `hook failed: "grafana-dashboard-relation-changed"`, in testrun https://solutions.qa.canonical.com/testruns/testRun/9e2d7f5d-7076-43f0-a522-b44ef20c5d37.

The messages in the logs look the same:
```
2022-07-01 16:43:10 ERROR unit.ceph-dashboard/2.juju-log server.go:319 grafana-dashboard:92: Command failed: b"Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: [('x509 certificate routines', 'X509_check_private_key', 'key values mismatch')]\n"
Traceback (most recent call last):
  File "./src/charm.py", line 379, in _run_cmd
    output = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ceph', 'dashboard', 'motd', 'clear']' returned non-zero exit status 5.
```

and
```
2022-07-01 16:43:11 ERROR unit.ceph-dashboard/2.juju-log server.go:319 grafana-dashboard:92: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 645, in <module>
    main(get_charm_class_for_release())
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/main.py", line 431, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/main.py", line 142, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 283, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 743, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 790, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/src/interface_grafana_dashboard.py", line 41, in _on_relation_changed
    self.on.dash_ready.emit()
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 283, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 743, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/framework.py", line 790, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 428, in _configure_dashboard
    self._configure_tls()
  File "./src/charm.py", line 552, in _configure_tls
    ceph_utils.dashboard_set_ssl_certificate(
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/charms_ceph/utils.py", line 3569, in _dashboard_set_ssl_artifact
    subprocess.check_call(cmd)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ceph', 'dashboard', 'set-ssl-certificate', 'juju-a8584f-1-lxd-1', '-i', PosixPath('/etc/ceph/ceph-dashboard.crt')]' returned non-zero exit status 5.
```

The crashdumps for this testrun can be f...

Read more...

Revision history for this message
Jeffrey Chang (modern911) wrote :

Thus bug occurred 11 times on both focal and jammy skus, out of 250+ yoga runs in last 2 weeks.
Not sure what change made it more frequent than we see earlier.

Listing some recent runs here
https://solutions.qa.canonical.com/v2/testruns/3b28aba4-9d69-4c9e-ace9-db2814e479bf
https://solutions.qa.canonical.com/v2/testruns/ae9bfc4a-f4be-41a3-8350-f19cd0c956f5
https://solutions.qa.canonical.com/v2/testruns/3786d698-a14e-4a19-8ce0-6b195752713b

Revision history for this message
Alexander Litvinov (alitvinov) wrote (last edit ):

Seeing same issue on the customer deployment.
Channel quincy/stable Rev 25

subprocess.CalledProcessError: Command '['ceph', 'dashboard', 'set-ssl-certificate', 'juju-3199d9-4-lxd-1', '-i', PosixPath('/etc/ceph/ceph-dashboard.crt')]' returned non-zero exit status 5.

executing manually gives
ubuntu@juju-3199d9-7-lxd-1:~$ sudo ceph dashboard set-ssl-certificate juju-3199d9-4-lxd-1 -i /etc/ceph/ceph-dashboard.crt
Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: [('x509 certificate routines', '', 'key values mismatch')]

I tried also launching hooks on the leader, rebooting units.
Not able to find any workaround at the moment.

subscribing ~field-high

Revision history for this message
Nobuto Murata (nobuto) wrote :

Similar error was reported to the Rook project fwiw:
https://github.com/rook/rook/issues/4207

There is no mention about the upstream bug number but there are some workarounds written:
> ceph config-key rm mgr/dashboard/key
> ceph config-key rm mgr/dashboard/crt
> ceph dashboard create-self-signed-cert

Which is basically clearing out the key/value store before executing the command :(

Revision history for this message
Alexander Litvinov (alitvinov) wrote :

@Nobuto

ubuntu@juju-3199d9-7-lxd-1:~$ sudo ceph config-key rm mgr/dashboard/key
key deleted
ubuntu@juju-3199d9-7-lxd-1:~$ sudo ceph config-key rm mgr/dashboard/crt
key deleted
ubuntu@juju-3199d9-7-lxd-1:~$ sudo ceph dashboard create-self-signed-cert
Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: [('x509 certificate routines', '', 'key values mismatch')]

Revision history for this message
Natalia Litvinova (natalytvinova) wrote :

subscribing field-critical, this just broke the deployment on a final redeploy before the handover

tags: added: cdo-qa foundations-engine
Changed in charm-ceph-dashboard:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-dashboard (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-dashboard (master)

Change abandoned by "utkarsh bhatt <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/884364
Reason: Reraised at https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/885009

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :
Download full text (4.9 KiB)

While testing the quincy/edge/chrome0 channel, we noticed a similar issue except that the failed hook is "radosgw-dashboard-relation-changed" (we can open a new bug if you prefer).

In the logs we see a couple of ceph commands failing with exist status 5, ending with:
==============
Traceback (most recent call last):
  File "./src/charm.py", line 378, in _run_cmd
    output = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ceph', 'dashboard', 'set-audit-api-log-payload', 'True']' returned non-zero exit status 5.
2023-06-10 02:37:29 ERROR unit.ceph-dashboard/2.juju-log server.go:316 dashboard:103: Command failed: b"Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: [('x509 certificate routines', 'X509_check_private_key', 'key values mismatch')]\n"
Traceback (most recent call last):
  File "./src/charm.py", line 378, in _run_cmd
    output = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ceph', 'dashboard', 'motd', 'clear']' returned non-zero exit status 5.
2023-06-10 02:37:29 DEBUG unit.ceph-dashboard/2.juju-log server.go:316 dashboard:103: Attempting to collect TLS config from relation
2023-06-10 02:37:29 DEBUG unit.ceph-dashboard/2.dashboard-relation-changed logger.go:60 Updating certificates in /etc/ssl/certs...
2023-06-10 02:37:30 DEBUG unit.ceph-dashboard/2.dashboard-relation-changed logger.go:60 0 added, 0 removed; done.
2023-06-10 02:37:30 DEBUG unit.ceph-dashboard/2.dashboard-relation-changed logger.go:60 Running hooks in /etc/ca-certificates/update.d...
2023-06-10 02:37:30 DEBUG unit.ceph-dashboard/2.dashboard-relation-changed logger.go:60 done.
2023-06-10 02:37:30 DEBUG unit.ceph-dashboard/2.juju-log server.go:316 dashboard:103: ['ceph', 'dashboard', 'set-ssl-certificate', 'juju-f98180-0-lxd-1', '-i', PosixPath('/etc/ceph/ceph-dashboard.crt')]
2023-06-10 02:37:31 WARNING unit.ceph-dashboard/2.dashboard-relation-changed logger.go:60 Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: [('x509 certificate routines', 'X509_check_private_key', 'key values mismatch')]
2023-06-10 02:37:31 ERROR unit.ceph-dashboard/2.juju-log server.go:316 dashboard:103: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 632, in <module>
    main(CephDashboardCharm)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/main.py", line 431, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-ceph-dashboard-2/charm/venv/ops/main.py", line 142, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  ...

Read more...

Revision history for this message
Alan Baghumian (alanbach) wrote :

FWIW, the initial deployment using Vault went on just fine:

$ juju deploy --series jammy --channel quincy/stable ceph-dashboard ceph-dashboard-ssd
$ juju add-relation ceph-dashboard-ssd:dashboard ceph-mon-ssd:dashboard
$ juju add-relation ceph-dashboard-ssd:certificates vault:certificates

However then I decided to add my Let's encrypt certificate, driving things south:

$ juju config ceph-dashboard-ssd ssl_ca="$(sudo openssl crl2pkcs7 -nocrl -certfile /etc/letsencrypt/live/int.hrizn.cloud/fullchain.pem | openssl pkcs7 -print_certs -outform PEM | base64)" ssl_cert="$(sudo openssl x509 -in /etc/letsencrypt/live/int.hrizn.cloud/fullchain.pem -outform PEM | base64)" ssl_key="$(sudo cat /etc/letsencrypt/live/int.hrizn.cloud/privkey.pem | base64)"

$ juju config ceph-dashboard-ssd public-hostname="ceph.int.hrizn.cloud"

$ juju remove-relation ceph-dashboard-ssd:certificates vault:certificates

These messages flooded the Mon logs and dashboard units went to relation error state:

2023-07-23T23:39:46.364+0000 7f8df3621640 -1 mgr.server reply reply (5) Input/output error Module 'dashboard' has experienced an error and cannot handle commands: [('x509 certificate routines', '', 'key values mismatch')]

Resetting the SSL juju config keys, deleting the mgr config keys then adding the juju vault relation back did not make a difference to fix the issue:

$ juju config ceph-dashboard-ssd --reset ssl_ca
$ juju config ceph-dashboard-ssd --reset ssl_key
$ juju config ceph-dashboard-ssd --reset ssl_cert

root@juju-b096f0-88-lxd-0:/var/log/ceph# ceph config-key rm mgr/dashboard/ca
key deleted

root@juju-b096f0-88-lxd-0:/var/log/ceph# ceph config-key rm mgr/dashboard/key
key deleted

root@juju-b096f0-88-lxd-0:/var/log/ceph# ceph config-key rm mgr/dashboard/crt
key deleted

Any updates on the progress?

Thanks,
Alan

Revision history for this message
utkarsh bhatt (utkarshbhatthere) wrote :

Hey, This bug is a part of the active items we are working on in our current pulse. Unfortunately, We have problems reproducing it often (I have not been able to yet) so your steps here @alanbach will be helpful. I am hopeful we'll pin-point the root cause soon.

Thanks,
Utkarsh Bhatt

Revision history for this message
Alan Baghumian (alanbach) wrote :

I was able to return to a healthy Vault based deployment by:

- Removed the juju application.
- Purged the ceph-mgr-dashboard packages:

$ juju run -a ceph-mon-ssd 'sudo apt-get -y --purge remove ceph-mgr-dashboard'

- Re-deployed as shown over the previous comment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-dashboard (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/884479
Committed: https://opendev.org/openstack/charm-ceph-dashboard/commit/53f664d0d19f0f066284fa1e79856b852d3c5c1a
Submitter: "Zuul (22348)"
Branch: master

commit 53f664d0d19f0f066284fa1e79856b852d3c5c1a
Author: utkarshbhatthere <email address hidden>
Date: Fri May 26 16:47:16 2023 +0530

    Fixes SSL conflicts between relation and config data.

    The fix adds event based handling of SSL configuration using charm
    config and cleanup of SSL for relation and config based key/certs.
    It also adds logical abstractions to analyse SSL setup and emit
    relevant events.

    Closes-Bug: 1952282
    Change-Id: Ic486434526f639f5985cfe355e303c1d6ff5fa0d
    Signed-off-by: utkarshbhatthere <email address hidden>
    func-test-pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/1090

Changed in charm-ceph-dashboard:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-dashboard (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896617

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-dashboard (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896617
Committed: https://opendev.org/openstack/charm-ceph-dashboard/commit/47bf770ca75d2cb606ffeacca780c9929295f453
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit 47bf770ca75d2cb606ffeacca780c9929295f453
Author: Peter Sabaini <email address hidden>
Date: Wed Sep 27 10:22:34 2023 +0200

    Fixes SSL conflicts between relation and config data.

    The fix adds event based handling of SSL configuration using charm
    config and cleanup of SSL for relation and config based key/certs.
    It also adds logical abstractions to analyse SSL setup and emit
    relevant events.

    Closes-Bug: 1952282
    Cherry-pick from: Ic486434526f639f5985cfe355e303c1d6ff5fa0d

    Change-Id: I2ad140b23a5d3e2e078d923afd039c4c904e0652

Changed in charm-ceph-dashboard:
status: Fix Committed → Fix Released
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Are there plans to backport this fix to octopus and pacific channels?

Felipe Reyes (freyes)
Changed in charm-ceph-dashboard:
status: Fix Released → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.