Prometheus module for scraping won't be enabled when metrics-endpoint-relation is added at the deployment time

Bug #2042891 reported by Samuel Allan
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Committed
High
Peter Sabaini

Bug Description

On a fresh deployment, prometheus scraping to collect metrics doesn't work. If the relation between prometheus and ceph-mon is bounced, it begins working:

```
juju remove-relation ceph-mon cos-prometheus

# wait a few minutes

juju relate ceph-mon cos-prometheus
```

I haven't had time to investigate further yet.

Revision history for this message
Samuel Allan (samuelallan) wrote :

It seems the metrics server simply isn't running on the ceph-mon unit in this case.

Revision history for this message
Nobuto Murata (nobuto) wrote :

I took some time to look into this in another occasion.

[cos prometheus side]

- honor_labels: true
  job_name: juju_ceph-build-plus_415bb55c_ceph-mon_prometheus_scrape-0
  metrics_path: /metrics
  relabel_configs:
  - regex: (.*)
    separator: _
    source_labels:
    - juju_model
    - juju_model_uuid
    - juju_application
    - juju_unit
    target_label: instance
  scrape_interval: 15s
  static_configs:
  - labels:
      juju_application: ceph-mon
      juju_charm: ceph-mon
      juju_model: ceph-build-plus
      juju_model_uuid: 415bb55c-341e-4aa4-8851-679bec38c471
      juju_unit: ceph-mon/0
    targets:
    - <IP_ADDRESS_OF_CEPH_MON_0>:9283

[ceph-mon/0]

$ sudo ceph mgr module ls
MODULE
balancer on (always on)
crash on (always on)
devicehealth on (always on)
orchestrator on (always on)
pg_autoscaler on (always on)
progress on (always on)
rbd_support on (always on)
status on (always on)
telemetry on (always on)
volumes on (always on)
dashboard on
iostat on
nfs on
restful on
alerts -
...
prometheus -

[juju show-unit ceph-mon/0]

  - relation-id: 30
    endpoint: metrics-endpoint
    cross-model: true
    related-endpoint: configurable-scrape-jobs
    application-data: {}
    related-units:
      cos-prometheus-scrape-config/0:
        in-scope: true
        data:
          egress-subnets: <cluster IP of service/prometheus-scrape-config>/32
          ingress-address: <cluster IP of service/prometheus-scrape-config>
          private-address: <cluster IP of service/prometheus-scrape-config>

^^^ this looks the same as another environment where there is no race condition.

Revision history for this message
Nobuto Murata (nobuto) wrote :

This part of the code will never be retried if the Ceph is not bootstrapped at that point.

    def _on_relation_changed(self, event):
        """Enable prometheus on relation change"""
        if self._charm.unit.is_leader() and ceph_utils.is_bootstrapped():
            logger.debug(
                "is_leader and is_bootstrapped, running rel changed: %s", event
            )
            mgr_config_set_rbd_stats_pools()
            ceph_utils.mgr_enable_module("prometheus")
            logger.debug("module_enabled")
            self.update_alert_rules()
            super()._on_relation_changed(event)

[when the relation is added at the deployment time (via bundle, etc.)]

# cat /var/log/juju/unit-ceph-mon-0.log | egrep 'metrics-endpoint-relation-|leader-|not bootstrapped'; ll /var/lib/ceph/mon/ceph-ceph-mon-1/done
2024-02-28 10:08:11 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-created" hook (via hook dispatching script: dispatch)
2024-02-28 10:08:16 INFO juju.worker.uniter resolver.go:165 found queued "leader-elected" hook
2024-02-28 10:08:35 INFO juju.worker.uniter.operation runhook.go:186 ran "leader-elected" hook (via hook dispatching script: dispatch)
2024-02-28 10:08:39 INFO unit.ceph-mon/0.juju-log server.go:325 Ceph is not bootstrapped, skipping upgrade checks.
2024-02-28 10:08:40 INFO unit.ceph-mon/0.juju-log server.go:325 Ceph is not bootstrapped, skipping upgrade checks.
2024-02-28 10:08:50 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-joined" hook (via hook dispatching script: dispatch)
2024-02-28 10:09:18 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-changed" hook (via hook dispatching script: dispatch)

-rw-r--r-- 1 root root 0 Feb 28 10:10 /var/lib/ceph/mon/ceph-ceph-mon-1/done

^^^ bootstrap was complete *after* running metrics-endpoint-relation-changed

[when the relation is added after the Ceph model settled]

# cat /var/log/juju/unit-ceph-mon-*.log | egrep 'metrics-endpoint-relation-|leader-|not bootstrapped'; ll /var/lib/ceph/mon/ceph-*/done
2024-02-28 04:43:43 INFO juju.worker.uniter resolver.go:165 found queued "leader-elected" hook
2024-02-28 04:43:46 INFO juju.worker.uniter.operation runhook.go:186 ran "leader-elected" hook (via hook dispatching script: dispatch)
2024-02-28 04:43:48 INFO unit.ceph-mon/1.juju-log server.go:325 Ceph is not bootstrapped, skipping upgrade checks.
2024-02-28 04:43:49 INFO unit.ceph-mon/1.juju-log server.go:325 Ceph is not bootstrapped, skipping upgrade checks.
2024-02-28 04:54:05 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-created" hook (via hook dispatching script: dispatch)
2024-02-28 04:54:07 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-joined" hook (via hook dispatching script: dispatch)
2024-02-28 04:54:15 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-changed" hook (via hook dispatching script: dispatch)

-rw-r--r-- 1 root root 0 Feb 28 04:44 /var/lib/ceph/mon/ceph-juju-704761-1-lxd-1/done

^^^ bootstrap was complete *before* running metrics-endpoint-relation-changed

summary: - Prometheus scraping not working at first
+ Prometheus module for scraping won't be enabled when metrics-endpoint-
+ relation is added at the deployment time
Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high since it's reproducible with a bundle deployment always.

Revision history for this message
Nobuto Murata (nobuto) wrote :

This is when the relation is removed and re-added by hand.

2024-03-01 08:08:14 DEBUG unit.ceph-mon/0.juju-log server.go:325 metrics-endpoint:30: module_disabled
2024-03-01 08:08:15 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-departed" hook (via hook dispatching script: dispatch)
2024-03-01 08:08:16 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-broken" hook (via hook dispatching script: dispatch)
2024-03-01 08:08:38 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-created" hook (via hook dispatching script: dispatch)
2024-03-01 08:08:39 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-joined" hook (via hook dispatching script: dispatch)
2024-03-01 08:08:41 DEBUG unit.ceph-mon/0.juju-log server.go:325 metrics-endpoint:31: module_enabled

-rw-r--r-- 1 root root 0 Feb 28 10:10 /var/lib/ceph/mon/ceph-ceph-mon-1/done

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Thanks. It looks like we need to defer the relation-changed event if the cluster isn't bootstrapped

Changed in charm-ceph-mon:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Peter Sabaini (peter-sabaini)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)
Changed in charm-ceph-mon:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/910706
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/762ad83c19cb5b699a1bdbe8c28e2d8dbef10e2b
Submitter: "Zuul (22348)"
Branch: master

commit 762ad83c19cb5b699a1bdbe8c28e2d8dbef10e2b
Author: Peter Sabaini <email address hidden>
Date: Fri Mar 1 09:57:09 2024 +0100

    Fix: defer cos-prometheus for bootstrap

    If a COS prometheus changed event is processed but bootstrap hasn't
    completed yet, we need to retry the event at a later time.

    Closes-bug: #2042891

    Change-Id: I3d274c09522f9d7ef56bc66f68d8488150c125d8

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/914556

Nobuto Murata (nobuto)
tags: added: field-ceph-dashboard
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/914556
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/83990c48051653f9d5c2d559a66f74047a414c7c
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit 83990c48051653f9d5c2d559a66f74047a414c7c
Author: Peter Sabaini <email address hidden>
Date: Fri Mar 1 09:57:09 2024 +0100

    Fix: defer cos-prometheus for bootstrap

    If a COS prometheus changed event is processed but bootstrap hasn't
    completed yet, we need to retry the event at a later time.

    Closes-bug: #2042891
    (cherry-picked from commit 762ad83c19cb5b699a1bdbe8c28e2d8dbef10e2b)

    Change-Id: I5790e56f7879904504bee69924c2454c97c30474

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

I'm still seeing the same issue with the new relation grafana-agent:cos-agent and ceph-mon:cos-agent.
For now the workaround is to run the config-changed hook.

juju exec -u ceph-mon/leader "JUJU_DISPATCH_PATH=hooks/config-changed ./dispatch"

Revision history for this message
Nobuto Murata (nobuto) wrote :

The "new" issue has been moved out to a separate report:
https://bugs.launchpad.net/charm-ceph-mon/+bug/2074337

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.