juju-core

juju occasionally switches a units public-address if an additional interface is added post-deployment

Bug #1435283 reported by Liam Young on 2015-03-23

This bug affects 5 people

	Status	Importance	Assigned to	Milestone
juju-core	Fix Released	High	Michael Foord	juju-core 1.26-alpha1
1.24	Fix Released	High	Michael Foord	juju-core 1.24.7
1.25	Fix Released	High	Michael Foord	juju-core 1.25.0

Bug Description

If an additional port is added to a guest then the juju public-address of that unit will occasionally change which breaks some of the Openstack teams tests (see below). It would be good if public-address didn't flip like this.

In the below example additional nics were added to neutron-gateway/0 and neutron-gateway/1. The ip of neutron-gateway/0 flipped but neutron-gateway/1 did not.

$ juju status neutron-gateway
environment: lytrusty
machines:
  "11":
    agent-state: started
    agent-version: 1.23-beta1.1
    dns-name: 10.5.21.19
    instance-id: f9b14208-53fd-4a04-8fe1-ccabf4c8a32d
    instance-state: ACTIVE
    series: trusty
    hardware: arch=amd64 cpu-cores=1 mem=1536M root-disk=10240M availability-zone=nova
  "12":
    agent-state: started
    agent-version: 1.23-beta1.1
    dns-name: 10.5.21.10
    instance-id: 67b917f1-95fd-4f2a-82fa-daf7f1e75437
    instance-state: ACTIVE
    series: trusty
    hardware: arch=amd64 cpu-cores=1 mem=1536M root-disk=10240M availability-zone=nova
services:
  neutron-gateway:
    charm: local:trusty/quantum-gateway-64
    exposed: false
    relations:
      amqp:
      - rabbitmq-server
      cluster:
      - neutron-gateway
      neutron-plugin-api:
      - neutron-api
      quantum-network-service:
      - nova-cloud-controller
      shared-db:
      - mysql
    units:
      neutron-gateway/0:
        agent-state: started
        agent-version: 1.23-beta1.1
        machine: "11"
        public-address: 10.5.21.19
      neutron-gateway/1:
        agent-state: started
        agent-version: 1.23-beta1.1
        machine: "12"
        public-address: 10.5.21.10

neutron-gateway/0 is still up and running but since it has switched to an ip which doesn't have the hosts services listening on it juju cmds fail:

$ juju ssh neutron-gateway/0 "uname -n"
ERROR subprocess encountered error code 1
ssh_exchange_identification: Connection closed by remote host
ERROR subprocess encountered error code 255

$ juju ssh neutron-gateway/1 "uname -n"
juju-lytrusty-machine-12
Connection to 10.5.21.10 closed.

$ ssh 10.5.21.9 "uname -n"
juju-lytrusty-machine-11

Why does this matter? The Openstack teams CI tests sometime break because the neutron-gateway guest becomes inaccessible by juju {run,ssh}. The reason for this is that during the post depoloyment network setup an additional nic (eth1) is added to the guest. The additional nic is on the same network as eth0 but acts as an external port and cannot be directly contacted for guest access.

Tags:

Revision history for this message

Liam Young (gnuoy) wrote on 2015-03-23:

I believe I've seen this on multiple versions of juju but the one the debug was taken from above was 1.23-beta1-trusty-amd64. The environment type was openstack.

I'll attach logs from the bootstrap node and from neutron-gateway/0

Revision history for this message

Liam Young (gnuoy) wrote on 2015-03-23:

bootstrap-juju-logs.tgz Edit (287.0 KiB, application/x-tar)

Revision history for this message

Liam Young (gnuoy) wrote on 2015-03-23:

neutron-gateway-0.tgz Edit (21.7 KiB, application/x-tar)

Martin Packman (gz) on 2015-03-23

Changed in juju-core:
importance:	Undecided → High
status:	New → Triaged
tags:	added: network

Curtis Hovey (sinzui) on 2015-04-21

Changed in juju-core:
milestone:	none → 1.24-alpha1

Curtis Hovey (sinzui) on 2015-04-27

Changed in juju-core:
milestone:	1.24-alpha1 → 1.25.0

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-04-30:

Michael, once done with the foreport of the feature flag stuff, please have a look at this one.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-04-30:

I had a quick chat with Liam on this one. It so far appears like the cause might be ordering issue. We're sorting addresses in lexicographical order when we see new ones and before updating them in the state db.

It will be useful to run $ juju set-env logging-config '<root>=TRACE' on the environment and post the unit (and its host machine) logs of for the affected unit, once it happens. At TRACE level we log in detail which address we pick for private/public when we have a list of possible addresses.

Revision history for this message

James Page (james-page) wrote on 2015-05-06:

OK - so I reproduced this on 1.23.2 - it happens in a very specific set of circumstances - four units have a second port allocated:

only juju-devel3-machine-12 observed a change in unit private-address from juju - this was the only one that rolled over the 99->100 barrier.

Revision history for this message

Michael Foord (mfoord) wrote on 2015-05-06:

Would it be sufficient to change address setting to leave the *first* address in place and only sort subsequent addresses?

Revision history for this message

Michael Foord (mfoord) wrote on 2015-05-14:

(So long as the first address is still in the new list of addresses to be set of course.)

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-05-14:

Preserving the order so the first address is on top is one option, but the real problem is we're not acting consistently. After a charm runs $ unit-get private-address (or public-address) the address we return should be the same every time (assuming it's still there - e.g. if it was on a NIC which is now down, we should pick another one valid I guess). So there might be a good idea to add the "address we picked initially for private/public" metadata to the address in state. It has to be backwards-compatible though in both how stored in mongo addresses are interpreted and passed over the API. I did suggest to add a "global-key-like" tag to the address.Value field (e.g. "1.2.3.4#defaut" where the "#default" is a "tag" of sorts saying "this address is the one to use for its respective scope" w.r.t. which one is considered public, local-cloud, etc.).

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-05-22:

#10

We won't manage to fix this for the scheduled 1.24 release on May 25, it will be in a follow-up point release or in 1.25. I'm dropping the 1.24 milestone from it for that reason.

no longer affects:

juju-core/1.24

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2015-07-03:

#11

@dimitern If you want to rely solely on what information the api can provide, I think a good approach would be as follows:

1. deploy service and juju creates instance with 1 interface attached
2. juju gets address given to that interface as allocated by Nova and uses this as unit address
3. juju gets port-id of that interface and remembers it
4. if address of the interface remembered in (3) changes, the unit address will change accordingly

This should give us the behaviour we want and be sufficiently deterministic and persistent assuming that the primary interface (port-id) never changes.

Dimiter Naydenov (dimitern) on 2015-07-03

tags:

added: addressability openstack-provider

Revision history for this message

Darryl Weaver (dweaver) wrote on 2015-07-09:

#12

This also applies to a MAAS environment, for example deploying a multi-network Openstack bundle exhibits the same inconsistency with addresses and the private address can change to another network plugged in.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-07-28:

#13

This might be a dup of bug #1463480

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-08-07:

#14

We aim to address this issue (most likely in the way suggested in comment #11) as soon as the feature freeze for 1.25.0 kicks in (on or around August, 20).

Curtis Hovey (sinzui) on 2015-08-13

tags:

added: bug-squad

Michael Foord (mfoord) on 2015-08-24

Changed in juju-core:
assignee:	nobody → Michael Foord (mfoord)

Revision history for this message

Michael Foord (mfoord) wrote on 2015-08-26:

#15

The current way we pick public / private addresses for a unit looks for the "best match" for the requested scope (public / private) and type (ipv4 / ipv6) - allowing fallbacks if an exact match isn't available.

So we can't just switch to picking one and always returning that, as an exact match might not be available the *first time* we're asked - but an exact match may become available later.

My suggestion is to switch to something like the following:

First time we're asked for an address we use the current algorithm to find the best match on scope and type. Whatever is found we store as the "default address" (we will store a default public and a default private address).
On subsequent requests check if the stored default is an exact match (and still available) for the requested scope / type.
If it is still available and an exact match we just return it.
If it is no longer available we remove the default and start again (we'll address using the same NIC for changed addresses at another point as that's more complex).
If it is still available, but it wasn't an exact match and an exact match is now available - we replace the current default with the exact match and return that. Subsequent requests will now always see the new default.

How does this sound Ed?

Curtis Hovey (sinzui) on 2015-08-27

Changed in juju-core:
milestone:	1.25-alpha1 → 1.25-beta1

Revision history for this message

Liam Young (gnuoy) wrote on 2015-08-28:

#16

Michael, that sounds like it would work perfectly for me, thanks.

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2015-09-10:

#17

Fix for 1.25 is proposed and should be landing early next week: http://reviews.vapour.ws/r/2593/

Michael Foord (mfoord) on 2015-09-14

Changed in juju-core:
status:	Triaged → In Progress

Alexis Bruemmer (alexis-bruemmer) on 2015-09-17

Changed in juju-core:
milestone:	1.25-beta1 → 1.25-beta2

Revision history for this message

Michael Foord (mfoord) wrote on 2015-09-24:

#18

A fix for this is committed to 1.24. Forward ports to 1.25 and master "in progress".

Michael Foord (mfoord) on 2015-09-29

Changed in juju-core:
status:	In Progress → Fix Committed

Revision history for this message

Michael Foord (mfoord) wrote on 2015-10-02:

#19

On 1.25 and master as well now.

Martin Packman (gz) on 2015-10-16

Changed in juju-core:
milestone:	1.25-beta2 → 1.26-alpha1

Curtis Hovey (sinzui) on 2015-10-16

Changed in juju-core:
status:	Fix Committed → Fix Released

Edward Hope-Morley (hopem) on 2015-11-05

tags:

added: sts

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.