_wait_all_computers gives up too soon

Bug #1661378 reported by Francis Ginther
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Autopilot Log Analyser
Fix Committed
Undecided
Francis Ginther
Landscape Server
Fix Released
Medium
Chad Smith

Bug Description

This was seen in CI: https://ci.lscape.net/job/landscape-system-tests/5175
Landscape server: 17.01~bzr10865+jenkins3423-2

This is very similar to lp:1660000 in which a slowly progress rabbitmq-server charm prevents landscape-client from making forward progress. But in this case, rabbitmqctl was not impacted by the known bug and the charm was progressing as fast as possible, it just has a lot to do.

Charm hooks are executing in serial, so as long as rabbitmq-server is busy processing hooks, the landscape-client suboridinate hooks don't get a chance to run and it doesn't make it far enough to register the client until it's too late.

Here's what happened:

09:48:30 first retry of _wait_all_computers (job-handler.log)
09:49:28 start amqp-relation-changed hooks (unit-rabbitmq-server-0.log)
10:39:36 amqp-relation-changed hooks finish (unit-rabbitmq-server-0.log)
10:39:39 landscape-client/21's install hook starts running (unit-landscape-client-21.log)
10:40:00 another amqp-relation-changed hook runs (unit-rabbitmq-server-0.log)
10:48:02 Last retry of _wait_all_computers (job-handler.log)
10:49:02 amqp-relation-changed hook finishes (unit-rabbitmq-server-0.log)
10:49:05 landscape-client/21 runs container-relation-joined (unit-landscape-client-21.log)
10:49:06 landscape-client/21 registers computer (unit-landscape-client-21.log)

The last retry of _wait_all_computers occurred just over a minute before the client was actually registered.

Logs from the rabbitmq-server/0 unit and the job-handler are attached.

Revision history for this message
Francis Ginther (fginther) wrote :
information type: Proprietary → Public
Changed in autopilot-log-analyser:
assignee: nobody → Francis Ginther (fginther)
status: New → In Progress
Revision history for this message
Adam Collard (adam-collard) wrote :

It waits an hour, how long are you suggesting it waits for?

Revision history for this message
Francis Ginther (fginther) wrote :

We modified the sequence of events in https://bugs.launchpad.net/landscape/+bug/1654399, maybe we should have bumped up this retry time too.

Do we have an upper bound on the autopilot deployment time? Would doubling the _wait_all_computers retry window cause us to hit that?

There is now a second example of this:
https://ci.lscape.net/job/landscape-system-tests/5215/

In 5215, the "Wait for computers to register" activity is:
    {
        "children": [],
        "schedule_after_time": null,
        "region_id": 1,
        "creator": {
            "id": 1,
            "name": "John Doe",
            "email": "<email address hidden>"
        },
        "result_text": "1 machines not yet registered",
        "activity_status": "failed",
        "creation_time": "2017-02-07T19:53:50Z",
        "schedule_before_time": null,
        "summary": "Wait for computers to register",
        "completion_time": "2017-02-07T22:06:13Z",
        "modification_time": "2017-02-07T22:06:13Z",
        "parent_id": 1,
        "type": "WaitForRegisteredComputersRequest",
        "id": 250,
        "result_code": null
    },

[from landscape-0-inner-logs/rabbitmq-server-0/var/log/juju/unit-landscape-client-20.log]
2017-02-07 22:06:26 INFO juju-log container:93: In container-relation-joined for landscape-client/20
2017-02-07 22:06:26 INFO juju-log container:93: Updating client configuration for landscape-client/20
2017-02-07 22:06:26 INFO juju-log container:93: landscape-client/20: landscape-client is not configured to run as per sysvconfig
2017-02-07 22:06:27 INFO container-relation-joined Restarting landscape-client (via systemctl): landscape-client.service.
2017-02-07 22:06:27 INFO juju-log container:93: Registering landscape-client/20

The missing client registered just seconds after the activity failed. I have not studied successful tests to determine if we are generally running close to the retry cap or if these are just outliers.

Changed in autopilot-log-analyser:
status: In Progress → Fix Committed
Chad Smith (chad.smith)
Changed in landscape:
milestone: 17.01 → 17.02
David Britton (dpb)
Changed in landscape:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Chad Smith (chad.smith)
Changed in landscape:
status: In Progress → Fix Committed
Changed in landscape:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.