[SRU] Watcher crashes on creation of multiple audits and gets stuck in PENDING

Bug #2091947 reported by Bryan Fraschetti
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Status tracked in Epoxy
Antelope
New
Undecided
Unassigned
Bobcat
New
Undecided
Unassigned
Caracal
New
Undecided
Unassigned
Dalmation
Fix Released
Undecided
Unassigned
Epoxy
Fix Released
Undecided
Unassigned
Yoga
New
Undecided
Bryan Fraschetti
Zed
New
Undecided
Unassigned
watcher (Ubuntu)
Status tracked in Plucky
Focal
New
Undecided
Unassigned
Jammy
New
Undecided
Bryan Fraschetti
Noble
New
Undecided
Unassigned
Oracular
Fix Released
Undecided
Unassigned
Plucky
Fix Released
Undecided
Unassigned

Bug Description

A customer is facing an issue where the watcher-decision-engine service crashes when creating an audit plan with the Audit type set to CONTINUOUS. Below are the steps to reproduce the issue:

Environment Details:
1. Deploy Openstack Yoga on Jammy with Watcher and Gnocchi as watcher's storage backend

2. Create an audit
openstack optimize audit create --name workload_stabilization_test_1 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger

3. Check the audit state
openstack optimize audit list
Observe it says "CONTINUOUS ONGOING"

4. Create a second audit
openstack optimize audit create --name workload_stabilization_test_2 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger

5. Check the audit state
openstack optimize audit list
Observe the second audit is stuck in "CONTINUOUS PENDING"

6. Check watcher's status and observe that it crashed with the following traceback
systemctl status watcher-decision-engine.service

Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: self.run()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3.10/threading.py", line 953, in run
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: self._target(*self._args, **self._kwargs)
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/schedulers/blocking.py", line 32, in _main_loop
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: wait_seconds = self._process_jobs()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/schedulers/base.py", line 1006, in _process_jobs
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: jobstore_next_run_time = jobstore.get_next_run_time()
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: File "/usr/lib/python3/dist-packages/apscheduler/jobstores/sqlalchemy.py", line 84, in get_next_run_time
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: return utc_timestamp_to_datetime(float(next_run_time))
Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: TypeError: float() argument must be a string or a real number, not 'NoneType'

This was fixed upstream in 2024.2 at https://opendev.org/openstack/watcher/commit/d6f169197efc5b4f6c8a2e6bc38177b0641ca05c which properly addresses the type conversion and https://opendev.org/openstack/watcher/commit/fbb290b2238e9e72054892e9ae6108a8907f47d7 which adjusts the unit tests to accommodate this fix.

Changed in watcher (Ubuntu Oracular):
status: New → Fix Released
Changed in watcher (Ubuntu Plucky):
status: New → Fix Released
no longer affects: watcher
summary: - Watcher crashes on creation of multiple audits and gets stuck in PENDING
+ [SRU] Watcher crashes on creation of multiple audits and gets stuck in
+ PENDING
Changed in watcher (Ubuntu Jammy):
assignee: nobody → Bryan Fraschetti (bryanfraschetti)
Changed in watcher (Ubuntu Plucky):
assignee: Bryan Fraschetti (bryanfraschetti) → nobody
Revision history for this message
Bryan Fraschetti (bryanfraschetti) wrote :

Debdiff for bobcat that applies the upstream patch and accompanying unit tests (squashed into one diff called omnibus-fixes.patch)

Revision history for this message
Bryan Fraschetti (bryanfraschetti) wrote :

Debdiff for antelope that applies the upstream patch and accompanying unit tests (squashed into one diff called omnibus-fixes.patch)

Revision history for this message
Bryan Fraschetti (bryanfraschetti) wrote :

Debdiff for zed that applies the upstream patch and accompanying unit tests (squashed into one diff called omnibus-fixes.patch)

Revision history for this message
Bryan Fraschetti (bryanfraschetti) wrote :

Debdiff for yoga that applies the upstream patch and accompanying unit tests (squashed into one diff called omnibus-fixes.patch)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.