juju occasionally switches a units public-address if an additional interface is added post-deployment

Bug #1435283 reported by Liam Young
40
This bug affects 5 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Michael Foord
1.24
Fix Released
High
Michael Foord
1.25
Fix Released
High
Michael Foord

Bug Description

If an additional port is added to a guest then the juju public-address of that unit will occasionally change which breaks some of the Openstack teams tests (see below). It would be good if public-address didn't flip like this.

In the below example additional nics were added to neutron-gateway/0 and neutron-gateway/1. The ip of neutron-gateway/0 flipped but neutron-gateway/1 did not.

$ juju status neutron-gateway
environment: lytrusty
machines:
  "11":
    agent-state: started
    agent-version: 1.23-beta1.1
    dns-name: 10.5.21.19
    instance-id: f9b14208-53fd-4a04-8fe1-ccabf4c8a32d
    instance-state: ACTIVE
    series: trusty
    hardware: arch=amd64 cpu-cores=1 mem=1536M root-disk=10240M availability-zone=nova
  "12":
    agent-state: started
    agent-version: 1.23-beta1.1
    dns-name: 10.5.21.10
    instance-id: 67b917f1-95fd-4f2a-82fa-daf7f1e75437
    instance-state: ACTIVE
    series: trusty
    hardware: arch=amd64 cpu-cores=1 mem=1536M root-disk=10240M availability-zone=nova
services:
  neutron-gateway:
    charm: local:trusty/quantum-gateway-64
    exposed: false
    relations:
      amqp:
      - rabbitmq-server
      cluster:
      - neutron-gateway
      neutron-plugin-api:
      - neutron-api
      quantum-network-service:
      - nova-cloud-controller
      shared-db:
      - mysql
    units:
      neutron-gateway/0:
        agent-state: started
        agent-version: 1.23-beta1.1
        machine: "11"
        public-address: 10.5.21.19
      neutron-gateway/1:
        agent-state: started
        agent-version: 1.23-beta1.1
        machine: "12"
        public-address: 10.5.21.10

$ nova list | grep -E '\-(11|12)'
| f9b14208-53fd-4a04-8fe1-ccabf4c8a32d | juju-lytrusty-machine-11 | ACTIVE | - | Running | gnuoy_admin_net=10.5.21.9, 10.5.21.19 |
| 67b917f1-95fd-4f2a-82fa-daf7f1e75437 | juju-lytrusty-machine-12 | ACTIVE | - | Running | gnuoy_admin_net=10.5.21.10, 10.5.21.20 |

neutron-gateway/0 is still up and running but since it has switched to an ip which doesn't have the hosts services listening on it juju cmds fail:

$ juju ssh neutron-gateway/0 "uname -n"
ERROR subprocess encountered error code 1
ssh_exchange_identification: Connection closed by remote host
ERROR subprocess encountered error code 255

$ juju ssh neutron-gateway/1 "uname -n"
juju-lytrusty-machine-12
Connection to 10.5.21.10 closed.

$ ssh 10.5.21.9 "uname -n"
juju-lytrusty-machine-11

Why does this matter? The Openstack teams CI tests sometime break because the neutron-gateway guest becomes inaccessible by juju {run,ssh}. The reason for this is that during the post depoloyment network setup an additional nic (eth1) is added to the guest. The additional nic is on the same network as eth0 but acts as an external port and cannot be directly contacted for guest access.

Revision history for this message
Liam Young (gnuoy) wrote :

I believe I've seen this on multiple versions of juju but the one the debug was taken from above was 1.23-beta1-trusty-amd64. The environment type was openstack.

I'll attach logs from the bootstrap node and from neutron-gateway/0

Revision history for this message
Liam Young (gnuoy) wrote :
Revision history for this message
Liam Young (gnuoy) wrote :
Martin Packman (gz)
Changed in juju-core:
importance: Undecided → High
status: New → Triaged
tags: added: network
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → 1.24-alpha1
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24-alpha1 → 1.25.0
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Michael, once done with the foreport of the feature flag stuff, please have a look at this one.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I had a quick chat with Liam on this one. It so far appears like the cause might be ordering issue. We're sorting addresses in lexicographical order when we see new ones and before updating them in the state db.

It will be useful to run $ juju set-env logging-config '<root>=TRACE' on the environment and post the unit (and its host machine) logs of for the affected unit, once it happens. At TRACE level we log in detail which address we pick for private/public when we have a list of possible addresses.

Revision history for this message
James Page (james-page) wrote :

OK - so I reproduced this on 1.23.2 - it happens in a very specific set of circumstances - four units have a second port allocated:

| 2f0e5d14-c830-4e29-bca9-d661b70a3b50 | juju-devel3-machine-12 | ACTIVE | - | Running | james-page_admin_net=10.5.19.99, 10.5.19.110 |
| 4d1a00f3-f6bd-4ed2-8db9-a01358d81beb | juju-devel3-machine-14 | ACTIVE | - | Running | james-page_admin_net=10.5.19.101, 10.5.19.112 |
| d99e5fb0-b35c-4ef8-8d16-8d424a23cc6b | juju-devel3-machine-15 | ACTIVE | - | Running | james-page_admin_net=10.5.19.102, 10.5.19.111 |
| 03b1db44-ef2d-42c1-9c5e-c58df08e5ecf | juju-devel3-machine-16 | ACTIVE | - | Running | james-page_admin_net=10.5.19.103, 10.5.19.113

only juju-devel3-machine-12 observed a change in unit private-address from juju - this was the only one that rolled over the 99->100 barrier.

Revision history for this message
Michael Foord (mfoord) wrote :

Would it be sufficient to change address setting to leave the *first* address in place and only sort subsequent addresses?

Revision history for this message
Michael Foord (mfoord) wrote :

(So long as the first address is still in the new list of addresses to be set of course.)

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Preserving the order so the first address is on top is one option, but the real problem is we're not acting consistently. After a charm runs $ unit-get private-address (or public-address) the address we return should be the same every time (assuming it's still there - e.g. if it was on a NIC which is now down, we should pick another one valid I guess). So there might be a good idea to add the "address we picked initially for private/public" metadata to the address in state. It has to be backwards-compatible though in both how stored in mongo addresses are interpreted and passed over the API. I did suggest to add a "global-key-like" tag to the address.Value field (e.g. "1.2.3.4#defaut" where the "#default" is a "tag" of sorts saying "this address is the one to use for its respective scope" w.r.t. which one is considered public, local-cloud, etc.).

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

We won't manage to fix this for the scheduled 1.24 release on May 25, it will be in a follow-up point release or in 1.25. I'm dropping the 1.24 milestone from it for that reason.

no longer affects: juju-core/1.24
Revision history for this message
Edward Hope-Morley (hopem) wrote :

@dimitern If you want to rely solely on what information the api can provide, I think a good approach would be as follows:

1. deploy service and juju creates instance with 1 interface attached
2. juju gets address given to that interface as allocated by Nova and uses this as unit address
3. juju gets port-id of that interface and remembers it
4. if address of the interface remembered in (3) changes, the unit address will change accordingly

This should give us the behaviour we want and be sufficiently deterministic and persistent assuming that the primary interface (port-id) never changes.

tags: added: addressability openstack-provider
Revision history for this message
Darryl Weaver (dweaver) wrote :

This also applies to a MAAS environment, for example deploying a multi-network Openstack bundle exhibits the same inconsistency with addresses and the private address can change to another network plugged in.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

This might be a dup of bug #1463480

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

We aim to address this issue (most likely in the way suggested in comment #11) as soon as the feature freeze for 1.25.0 kicks in (on or around August, 20).

Curtis Hovey (sinzui)
tags: added: bug-squad
Michael Foord (mfoord)
Changed in juju-core:
assignee: nobody → Michael Foord (mfoord)
Revision history for this message
Michael Foord (mfoord) wrote :

The current way we pick public / private addresses for a unit looks for the "best match" for the requested scope (public / private) and type (ipv4 / ipv6) - allowing fallbacks if an exact match isn't available.

So we can't just switch to picking one and always returning that, as an exact match might not be available the *first time* we're asked - but an exact match may become available later.

My suggestion is to switch to something like the following:

First time we're asked for an address we use the current algorithm to find the best match on scope and type. Whatever is found we store as the "default address" (we will store a default public and a default private address).
On subsequent requests check if the stored default is an exact match (and still available) for the requested scope / type.
If it is still available and an exact match we just return it.
If it is no longer available we remove the default and start again (we'll address using the same NIC for changed addresses at another point as that's more complex).
If it is still available, but it wasn't an exact match and an exact match is now available - we replace the current default with the exact match and return that. Subsequent requests will now always see the new default.

How does this sound Ed?

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25-alpha1 → 1.25-beta1
Revision history for this message
Liam Young (gnuoy) wrote :

Michael, that sounds like it would work perfectly for me, thanks.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Fix for 1.25 is proposed and should be landing early next week: http://reviews.vapour.ws/r/2593/

Michael Foord (mfoord)
Changed in juju-core:
status: Triaged → In Progress
Changed in juju-core:
milestone: 1.25-beta1 → 1.25-beta2
Revision history for this message
Michael Foord (mfoord) wrote :

A fix for this is committed to 1.24. Forward ports to 1.25 and master "in progress".

Michael Foord (mfoord)
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Michael Foord (mfoord) wrote :

On 1.25 and master as well now.

Martin Packman (gz)
Changed in juju-core:
milestone: 1.25-beta2 → 1.26-alpha1
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
tags: added: sts
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.