Ceph Dashboard Charm

[COS] Ceph Grafana dashboard has "no data" panels

Bug #2041500 reported by Nobuto Murata on 2023-10-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ceph Dashboard Charm	New	Undecided	Unassigned

Bug Description

We all know that we have had issues with Ceph Grafana dashboard for some time. e.g.

https://bugs.launchpad.net/charm-ceph-dashboard/+bug/1982910
https://bugs.launchpad.net/charm-ceph-dashboard/+bug/1982912
https://bugs.launchpad.net/charm-ceph-dashboard/+bug/1989648
https://bugs.launchpad.net/charm-ceph-dashboard/+bug/1982537

And that hasn't been changed much even after migrating from Telegraf to node-exporter as part of COS.

This bug is to track the work to make dashboards work with Charmed Ceph. I captured screenshots both with the current JSONs in the charm in https://review.opendev.org/c/openstack/charm-ceph-dashboard/+/896248/5 and the upstream JSONs.

https://drive.google.com/drive/folders/1ds2gSRnOX_L4SfRv7HfptkrI9mA7mZ5Q?usp=sharing

See original description

Tags:

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-10-27:

Screenshot 2023-10-27 at 15-19-45 OSD Overview - Dashboards - Dashboards - Grafana.png Edit (271.7 KiB, image/png)

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-10-27:

Subscribing ~field-high as Ceph (Grafana) dashboard is not functioning.

Nobuto Murata (nobuto) on 2023-10-31

description:

updated

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-10-31:

One example of "no data" query is "sum(irate(ceph_osd_recovery_ops[1m]))" regardless it's from the charm or the upstream.

[charm]
https://github.com/openstack/charm-ceph-dashboard/blob/4ee08c02972ba174ba379728e9ab1f045bacd1a4/src/dashboards/ceph-cluster.json#L1426

[upstream]
https://github.com/ceph/ceph/blob/21548fe806cf259deac1421530d5ce720be17997/monitoring/ceph-mixin/dashboards_out/ceph-cluster.json#L1107

That's because the scrape_interval in COS is 1m although Ceph upstream expects 15s, and there are no two data points in the 1m range in the query above as a result.
https://prometheus.io/docs/prometheus/latest/querying/functions/#irate

And customizing the scrape_interval is "strongly discouraged" so a workaround is to use prometheus-scrape-config-k8s charm in the middle.
https://github.com/canonical/prometheus-k8s-operator/blob/16ba0e867b571d17ac8e87af7ab5720228d53d52/lib/charms/prometheus_k8s/v0/prometheus_scrape.py#L172-L190

# LP: #2041500
# the interval is from:
# https://docs.ceph.com/en/latest/mgr/prometheus/#confval-mgr-prometheus-scrape_interval
juju deploy -m cos prometheus-scrape-config-k8s prometheus-scrape-config --config scrape_interval=15s
juju integrate -m cos prometheus:metrics-endpoint prometheus-scrape-config:metrics-endpoint

juju offer -m cos prometheus-scrape-config:configurable-scrape-jobs
juju consume cos.prometheus-scrape-config cos-prometheus-scrape-config
juju integrate ceph-mon:metrics-endpoint cos-prometheus-scrape-config:configurable-scrape-jobs