Comment 0 for bug 1455260

Revision history for this message
Curtis Hovey (sinzui) wrote :

This is a meta bug that describes a problem with many symptoms and many advisable workarounds.

Enterprises commonly automate the deployment of services. They can use juju-quickstart, juju-deployer, landscape autopilot, or their own bespoke script to bootstrap an environment and deploy services. It works repeatedly, reliably for many weeks or months, until a new micro version of juju is placed in the streams. The enterprise sees failures in many ways, commonly the deployment fails because the script lost connection to its watcher, or the juju failed to upgrade at the same moment that charms are installing and configuring services.

When a juju state-server is bootstrapped one its first actions, is to query streams, and start an upgrade to the highest micro version for its major.minor version. eg, the juju client installed 1.22.1, and the current version in streams is 1.22.3, start upgrading. This upgrade will complete in less than a minute. A savvy script bootstrapping an env would wait for an upgrade to complete because upgrading a single state-server is faster and more reliable that upgrading services too.

Enterprises do not like default behaviour however, and their tools were not written to account for this "surprising" behaviour. There are several strategies employed to ensure the state-server is exactly the version that was tested previously:

A. Juju CI and a few others set "agent-version: 1.22.1" to ensure the state-server matches the version under test. But many parties do not like this method because environments.yaml must change each time they upgrade to a new juju client (and server).

B. Canonical IS and many customers use --upload-tools to force the state-server to be a known version. This however make explicit upgrades VERY unpredictable because the juju-client selects agents based on the localhost's arch, series, and $PATH (which might include development jujus). There are several bugs about failed upgrades, and --upload-tools was a factor.

C. The company never uses current juju. They choose to use a version that is not getting updates, like 1.22.x which is not 1.23.x, except that we have delivered updates to their surprise.

Juju chooses to implicitly upgrade because it is a way to deliver compatibility fixes. Azure, AWS, and HP have changed their clouds, and we delivered a new micro version of juju a few days later to ensure juju "Just worked".

Several changes are needed to ensure enterprises have a reliable and repeatable experience:

1. All API clients must reconnect watchers when they are disconnected. They will be disconnected because of network issues as well as explicit disconnects durin upgrade-juju. The clients need to resume their work.

2.A If juju must upgrade, it needs to prevent clients from starting work until the upgrade is complete, this might mean bootstrap doesn't let go until the state-server is upgraded.

2.B Or juju stops implicit upgrades. No enterprises uses this feature; they work hard to disable it. Juju could instead inform the party that an upgrade is available (as is done for charms).