Prometheus module for scraping won't be enabled when metrics-endpoint-relation is added at the deployment time

Bug #2042891 reported by Samuel Allan
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Committed
High
Peter Sabaini

Bug Description

On a fresh deployment, prometheus scraping to collect metrics doesn't work. If the relation between prometheus and ceph-mon is bounced, it begins working:

```
juju remove-relation ceph-mon cos-prometheus

# wait a few minutes

juju relate ceph-mon cos-prometheus
```

I haven't had time to investigate further yet.

Revision history for this message
Samuel Allan (samuelallan) wrote :

It seems the metrics server simply isn't running on the ceph-mon unit in this case.

Revision history for this message
Nobuto Murata (nobuto) wrote :

I took some time to look into this in another occasion.

[cos prometheus side]

- honor_labels: true
  job_name: juju_ceph-build-plus_415bb55c_ceph-mon_prometheus_scrape-0
  metrics_path: /metrics
  relabel_configs:
  - regex: (.*)
    separator: _
    source_labels:
    - juju_model
    - juju_model_uuid
    - juju_application
    - juju_unit
    target_label: instance
  scrape_interval: 15s
  static_configs:
  - labels:
      juju_application: ceph-mon
      juju_charm: ceph-mon
      juju_model: ceph-build-plus
      juju_model_uuid: 415bb55c-341e-4aa4-8851-679bec38c471
      juju_unit: ceph-mon/0
    targets:
    - <IP_ADDRESS_OF_CEPH_MON_0>:9283

[ceph-mon/0]

$ sudo ceph mgr module ls
MODULE
balancer on (always on)
crash on (always on)
devicehealth on (always on)
orchestrator on (always on)
pg_autoscaler on (always on)
progress on (always on)
rbd_support on (always on)
status on (always on)
telemetry on (always on)
volumes on (always on)
dashboard on
iostat on
nfs on
restful on
alerts -
...
prometheus -

[juju show-unit ceph-mon/0]

  - relation-id: 30
    endpoint: metrics-endpoint
    cross-model: true
    related-endpoint: configurable-scrape-jobs
    application-data: {}
    related-units:
      cos-prometheus-scrape-config/0:
        in-scope: true
        data:
          egress-subnets: <cluster IP of service/prometheus-scrape-config>/32
          ingress-address: <cluster IP of service/prometheus-scrape-config>
          private-address: <cluster IP of service/prometheus-scrape-config>

^^^ this looks the same as another environment where there is no race condition.

Revision history for this message
Nobuto Murata (nobuto) wrote :

This part of the code will never be retried if the Ceph is not bootstrapped at that point.

    def _on_relation_changed(self, event):
        """Enable prometheus on relation change"""
        if self._charm.unit.is_leader() and ceph_utils.is_bootstrapped():
            logger.debug(
                "is_leader and is_bootstrapped, running rel changed: %s", event
            )
            mgr_config_set_rbd_stats_pools()
            ceph_utils.mgr_enable_module("prometheus")
            logger.debug("module_enabled")
            self.update_alert_rules()
            super()._on_relation_changed(event)

[when the relation is added at the deployment time (via bundle, etc.)]

# cat /var/log/juju/unit-ceph-mon-0.log | egrep 'metrics-endpoint-relation-|leader-|not bootstrapped'; ll /var/lib/ceph/mon/ceph-ceph-mon-1/done
2024-02-28 10:08:11 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-created" hook (via hook dispatching script: dispatch)
2024-02-28 10:08:16 INFO juju.worker.uniter resolver.go:165 found queued "leader-elected" hook
2024-02-28 10:08:35 INFO juju.worker.uniter.operation runhook.go:186 ran "leader-elected" hook (via hook dispatching script: dispatch)
2024-02-28 10:08:39 INFO unit.ceph-mon/0.juju-log server.go:325 Ceph is not bootstrapped, skipping upgrade checks.
2024-02-28 10:08:40 INFO unit.ceph-mon/0.juju-log server.go:325 Ceph is not bootstrapped, skipping upgrade checks.
2024-02-28 10:08:50 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-joined" hook (via hook dispatching script: dispatch)
2024-02-28 10:09:18 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-changed" hook (via hook dispatching script: dispatch)

-rw-r--r-- 1 root root 0 Feb 28 10:10 /var/lib/ceph/mon/ceph-ceph-mon-1/done

^^^ bootstrap was complete *after* running metrics-endpoint-relation-changed

[when the relation is added after the Ceph model settled]

# cat /var/log/juju/unit-ceph-mon-*.log | egrep 'metrics-endpoint-relation-|leader-|not bootstrapped'; ll /var/lib/ceph/mon/ceph-*/done
2024-02-28 04:43:43 INFO juju.worker.uniter resolver.go:165 found queued "leader-elected" hook
2024-02-28 04:43:46 INFO juju.worker.uniter.operation runhook.go:186 ran "leader-elected" hook (via hook dispatching script: dispatch)
2024-02-28 04:43:48 INFO unit.ceph-mon/1.juju-log server.go:325 Ceph is not bootstrapped, skipping upgrade checks.
2024-02-28 04:43:49 INFO unit.ceph-mon/1.juju-log server.go:325 Ceph is not bootstrapped, skipping upgrade checks.
2024-02-28 04:54:05 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-created" hook (via hook dispatching script: dispatch)
2024-02-28 04:54:07 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-joined" hook (via hook dispatching script: dispatch)
2024-02-28 04:54:15 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-changed" hook (via hook dispatching script: dispatch)

-rw-r--r-- 1 root root 0 Feb 28 04:44 /var/lib/ceph/mon/ceph-juju-704761-1-lxd-1/done

^^^ bootstrap was complete *before* running metrics-endpoint-relation-changed

summary: - Prometheus scraping not working at first
+ Prometheus module for scraping won't be enabled when metrics-endpoint-
+ relation is added at the deployment time
Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high since it's reproducible with a bundle deployment always.

Revision history for this message
Nobuto Murata (nobuto) wrote :

This is when the relation is removed and re-added by hand.

2024-03-01 08:08:14 DEBUG unit.ceph-mon/0.juju-log server.go:325 metrics-endpoint:30: module_disabled
2024-03-01 08:08:15 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-departed" hook (via hook dispatching script: dispatch)
2024-03-01 08:08:16 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-broken" hook (via hook dispatching script: dispatch)
2024-03-01 08:08:38 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-created" hook (via hook dispatching script: dispatch)
2024-03-01 08:08:39 INFO juju.worker.uniter.operation runhook.go:186 ran "metrics-endpoint-relation-joined" hook (via hook dispatching script: dispatch)
2024-03-01 08:08:41 DEBUG unit.ceph-mon/0.juju-log server.go:325 metrics-endpoint:31: module_enabled

-rw-r--r-- 1 root root 0 Feb 28 10:10 /var/lib/ceph/mon/ceph-ceph-mon-1/done

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Thanks. It looks like we need to defer the relation-changed event if the cluster isn't bootstrapped

Changed in charm-ceph-mon:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Peter Sabaini (peter-sabaini)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)
Changed in charm-ceph-mon:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/910706
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/762ad83c19cb5b699a1bdbe8c28e2d8dbef10e2b
Submitter: "Zuul (22348)"
Branch: master

commit 762ad83c19cb5b699a1bdbe8c28e2d8dbef10e2b
Author: Peter Sabaini <email address hidden>
Date: Fri Mar 1 09:57:09 2024 +0100

    Fix: defer cos-prometheus for bootstrap

    If a COS prometheus changed event is processed but bootstrap hasn't
    completed yet, we need to retry the event at a later time.

    Closes-bug: #2042891

    Change-Id: I3d274c09522f9d7ef56bc66f68d8488150c125d8

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/914556

Nobuto Murata (nobuto)
tags: added: field-ceph-dashboard
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/914556
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/83990c48051653f9d5c2d559a66f74047a414c7c
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit 83990c48051653f9d5c2d559a66f74047a414c7c
Author: Peter Sabaini <email address hidden>
Date: Fri Mar 1 09:57:09 2024 +0100

    Fix: defer cos-prometheus for bootstrap

    If a COS prometheus changed event is processed but bootstrap hasn't
    completed yet, we need to retry the event at a later time.

    Closes-bug: #2042891
    (cherry-picked from commit 762ad83c19cb5b699a1bdbe8c28e2d8dbef10e2b)

    Change-Id: I5790e56f7879904504bee69924c2454c97c30474

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

I'm still seeing the same issue with the new relation grafana-agent:cos-agent and ceph-mon:cos-agent.
For now the workaround is to run the config-changed hook.

juju exec -u ceph-mon/leader "JUJU_DISPATCH_PATH=hooks/config-changed ./dispatch"

Revision history for this message
Nobuto Murata (nobuto) wrote :

The "new" issue has been moved out to a separate report:
https://bugs.launchpad.net/charm-ceph-mon/+bug/2074337

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-mon (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/926243

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/926243
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/8f59007236bae09cfaf1df38c650c4144ca84ee7
Submitter: "Zuul (22348)"
Branch: master

commit 8f59007236bae09cfaf1df38c650c4144ca84ee7
Author: Nobuto Murata <email address hidden>
Date: Wed Aug 14 13:32:21 2024 +0900

    Defer cos-prometheus for bootstrap

    When the cluster is not yet bootstrapped, we need to defer the event of
    enabling the prometheus module. This is the same logic as
    I3d274c09522f9d7ef56bc66f68d8488150c125d8.

    Closes-bug: #2074337
    Related-bug: #2042891
    Change-Id: Id9fd3c8bad504bfe7610de856798114f2b8c0fd3

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-mon (stable/squid-jammy)

Related fix proposed to branch: stable/squid-jammy
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/926450

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-mon (stable/reef)

Related fix proposed to branch: stable/reef
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/926451

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-mon (stable/quincy.2)

Related fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/926452

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-mon (stable/reef)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/926451
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/8b9e5f8d844cf23fd00abb55fa1c22228f51814f
Submitter: "Zuul (22348)"
Branch: stable/reef

commit 8b9e5f8d844cf23fd00abb55fa1c22228f51814f
Author: Nobuto Murata <email address hidden>
Date: Wed Aug 14 13:32:21 2024 +0900

    Defer cos-prometheus for bootstrap

    When the cluster is not yet bootstrapped, we need to defer the event of
    enabling the prometheus module. This is the same logic as
    I3d274c09522f9d7ef56bc66f68d8488150c125d8.

    Closes-bug: #2074337
    Related-bug: #2042891
    Change-Id: Id9fd3c8bad504bfe7610de856798114f2b8c0fd3
    (cherry picked from commit 8f59007236bae09cfaf1df38c650c4144ca84ee7)

tags: added: in-stable-reef
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-mon (stable/squid-jammy)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/926450
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/6c66dcc18dcd80f86d2296b62e0b7f2b61d43029
Submitter: "Zuul (22348)"
Branch: stable/squid-jammy

commit 6c66dcc18dcd80f86d2296b62e0b7f2b61d43029
Author: Nobuto Murata <email address hidden>
Date: Wed Aug 14 13:32:21 2024 +0900

    Defer cos-prometheus for bootstrap

    When the cluster is not yet bootstrapped, we need to defer the event of
    enabling the prometheus module. This is the same logic as
    I3d274c09522f9d7ef56bc66f68d8488150c125d8.

    Closes-bug: #2074337
    Related-bug: #2042891
    Change-Id: Id9fd3c8bad504bfe7610de856798114f2b8c0fd3
    (cherry picked from commit 8f59007236bae09cfaf1df38c650c4144ca84ee7)

tags: added: in-stable-squid-jammy
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-mon (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/926452
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/ce568ed348bbccbd1cc8a012d187a9c2105cc972
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit ce568ed348bbccbd1cc8a012d187a9c2105cc972
Author: Nobuto Murata <email address hidden>
Date: Wed Aug 14 13:32:21 2024 +0900

    Defer cos-prometheus for bootstrap

    When the cluster is not yet bootstrapped, we need to defer the event of
    enabling the prometheus module. This is the same logic as
    I3d274c09522f9d7ef56bc66f68d8488150c125d8.

    Closes-bug: #2074337
    Related-bug: #2042891
    Change-Id: Id9fd3c8bad504bfe7610de856798114f2b8c0fd3
    (cherry picked from commit 8f59007236bae09cfaf1df38c650c4144ca84ee7)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.