HA deployments taking close to the CI timeout
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
High
|
Emilien Macchi |
Bug Description
I've noticed some ha jobs timing out with no apparent errors, and looking at the graphite data this is likely because our ha deployments are just taking close to the 1 hour 20 minute timeout, so any slowdown in a job is going to cause it to hit the timeout.
Performance seems to have regressed pretty significantly over the past month or so. You can see this in the graph at https:/
Unfortunately, until a few days ago we didn't split things out by release, so the older metrics probably look somewhat better than they should because stable branch jobs are also included, and those are consistently faster to deploy than master (see below for details on this).
I do have some suspicion that part of the regression may be related to moving around the nova and keystone initialization in https:/
Note that we don't collect metrics on failed jobs, so any jobs that exceed 4800 seconds for the deploy are not going to show up here, so the average might still be skewed a little low.
Further supporting the possibility that the step changes are involved, here are the deployment times for the longest four steps as of a week ago:
2017-02-07 06:50:21.000 | ControllerDeplo
2017-02-07 06:50:21.000 | ControllerDeplo
2017-02-07 06:50:21.000 | ControllerDeplo
2017-02-07 06:50:21.000 | ControllerDeplo
And a recent job:
2017-02-13 21:10:27.000 | ControllerDeplo
2017-02-13 21:10:27.000 | ControllerDeplo
2017-02-13 21:10:27.000 | ControllerDeplo
2017-02-13 21:10:27.000 | ControllerDeplo
Steps 3 and 4 have nearly doubled in length. Step 2 is a little longer, which suggests the testenv for this job may have been somewhat slower too, but even if you subtract a corresponding amount of time from each of the other three steps you still end up much longer than before.
So, while the immediate concern is ci jobs timing out, this is also a pretty significant regression in performance in a release that was already running noticeably slower than newton (which, in turn, was noticeably slower than mitaka - sensing a trend here?). For reference, here's a graph including the newton and mitaka-specific ha deploy times: https:/
I'm not sure if we have time to do anything in Ocata so late in the cycle, but it's something we definitely need to look into.
tags: | added: alert |
Changed in tripleo: | |
milestone: | ocata-rc1 → ocata-rc2 |
I think oooq jobs experiencing the same problem, half of them finishes with timeout. I track it here: https:/ /bugs.launchpad .net/tripleo/ +bug/1663310 because it has another problem - connecting to undercloud fails in postci function.