Allowing duplicate secgroups via neutron breaks 2.0.3 models with 2.1 controllers

Bug #1671265 reported by Colin Watson
58
This bug affects 10 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Ian Booth
2.1
Fix Released
Critical
Heather Lanigan

Bug Description

Last night, the Juju 2 controllers in PS4.5 were upgraded to Juju 2.1 (I think 2.1.1). Some of the models are still on 2.0.3. I don't know for sure that this bug only affects the older models, but that's all the data I have so far.

My first attempted deployment after the controller upgrade failed, with "juju status --format=yaml" reporting (elided the silly byte-by-byte UserData serialisations for length; more complete version in https://pastebin.canonical.com/181839/):

  "359":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 08 Mar 2017 13:52:50Z
    instance-id: pending
    machine-status:
      current: provisioning error
      message: |-
        cannot run instance: failed to run a server with nova.RunServerOpts{Name:"juju-83af10-stg-ols-snap-build-359", FlavorId:"e0ec0afc-6baa-4dda-aeae-164eed4590e3", ImageId:"df6158c5-1733-4972-84d8-480ead5af756", UserData:[]uint8{...}, SecurityGroupNames:[]nova.SecurityGroupName{nova.SecurityGroupName{Name:"juju-4955af7b-a72f-4076-88e4-99854e92d50f-ffaf8edc-1e06-490e-8991-1700fd83af10"}, nova.SecurityGroupName{Name:"juju-4955af7b-a72f-4076-88e4-99854e92d50f-ffaf8edc-1e06-490e-8991-1700fd83af10-359"}}, Networks:[]nova.ServerNetworks{nova.ServerNetworks{NetworkId:"cf014c32-f1ab-4447-a12e-5a2832f61452", FixedIp:"", PortId:""}}, AvailabilityZone:"prodstack-zone-2", Metadata:map[string]string{"juju-controller-uuid":"4955af7b-a72f-4076-88e4-99854e92d50f", "juju-model-uuid":"ffaf8edc-1e06-490e-8991-1700fd83af10", "juju-units-deployed":"snap-build-app-r6923965/0"}, ConfigDrive:false}
        caused by: request (http://10.24.0.176:8774/v2/f7efb9e5b0a7457795db11179089c583/servers) returned unexpected status: 409; error info: {"conflictingRequest": {"message": "Multiple security_group matches found for name 'juju-4955af7b-a72f-4076-88e4-99854e92d50f-ffaf8edc-1e06-490e-8991-1700fd83af10', use an ID to be more specific.", "code": 409}}
      since: 08 Mar 2017 13:48:50Z
    series: xenial
  "360":
    juju-status:
      current: down
      message: agent is not communicating with the server
      since: 08 Mar 2017 13:52:52Z
    instance-id: pending
    machine-status:
      current: provisioning error
      message: |-
        cannot run instance: failed to run a server with nova.RunServerOpts{Name:"juju-83af10-stg-ols-snap-build-360", FlavorId:"e0ec0afc-6baa-4dda-aeae-164eed4590e3", ImageId:"df6158c5-1733-4972-84d8-480ead5af756", UserData:[]uint8{...}, SecurityGroupNames:[]nova.SecurityGroupName{nova.SecurityGroupName{Name:"juju-4955af7b-a72f-4076-88e4-99854e92d50f-ffaf8edc-1e06-490e-8991-1700fd83af10"}, nova.SecurityGroupName{Name:"juju-4955af7b-a72f-4076-88e4-99854e92d50f-ffaf8edc-1e06-490e-8991-1700fd83af10-360"}}, Networks:[]nova.ServerNetworks{nova.ServerNetworks{NetworkId:"cf014c32-f1ab-4447-a12e-5a2832f61452", FixedIp:"", PortId:""}}, AvailabilityZone:"prodstack-zone-2", Metadata:map[string]string{"juju-controller-uuid":"4955af7b-a72f-4076-88e4-99854e92d50f", "juju-model-uuid":"ffaf8edc-1e06-490e-8991-1700fd83af10", "juju-units-deployed":"snap-build-app-r6923965/1"}, ConfigDrive:false}
        caused by: request (http://10.24.0.176:8774/v2/f7efb9e5b0a7457795db11179089c583/servers) returned unexpected status: 409; error info: {"conflictingRequest": {"message": "Multiple security_group matches found for name 'juju-4955af7b-a72f-4076-88e4-99854e92d50f-ffaf8edc-1e06-490e-8991-1700fd83af10', use an ID to be more specific.", "code": 409}}
      since: 08 Mar 2017 13:49:30Z
    series: xenial

I tried removing the new secgroup, but that didn't help; juju just put it right back on the next run. I then tried removing the old secgroup. That required replacing it on all machines, and for some reason this mysteriously caused the units within the deployment to be unable to send any packets to each other. Chris Stratford eventually suggested the workaround of running "juju add-unit" on a random application, which apparently caused juju to fix up the inter-unit networking although we couldn't work out exactly what it had done.

Since all that, I've heard the same report from two other people, so I investigated further. I tracked down commit 3e8e8e5c9c68237fb2cc7d10588711a900eacc1b ("OpenStack Provider: Use Neutron networking instead of Nova for FloatingIPs, SecurityGroups, SecurityGroupRules and Networks"), which I'm pretty sure is responsible for this. Some portions of the diff jump out at me:

- defaultGroup, err := c.environ.nova().SecurityGroupByName("default")
+ // Security Group Names in Neutron do not have to be unique. This
+ // function returns an array
+ defaultGroups, err := c.environ.neutron().SecurityGroupByNameV2("default")

- group, err := novaClient.SecurityGroupByName(name)
+ groupsFound, err := neutronClient.SecurityGroupByNameV2(name)
        if err == nil {
- // Group exists, so assume it is correctly set up and return it.
- // TODO(jam): 2013-09-18 http://pad.lv/121795
- // We really should verify the group is set up correctly,
- // because deleting and re-creating environments can get us bad
- // groups (especially if they were set up under Python)
- return *group, nil
+ for _, group := range groupsFound {
+ if c.verifyGroupRules(group.Rules, rules) {
+ return group, nil
+ }
+ }
        }

So I suspect that a better workaround in this situation would be to delete the new secgroup and add the correct rules (possibly the IPv6 rules, which weren't present in the old secgroup) to the old secgroup using neutron. However, this seems to be a pretty bad incompatibility. Perhaps juju should be more conservative and use the old "find existing group and append to it" behaviour with neutron just as it used to do with nova, even if neutron itself is happy with duplicate secgroup names?

Revision history for this message
Colin Watson (cjwatson) wrote :

I believe that even Juju 2.1 runs nova with the security group name rather than its ID, so it seems to me that this will probably break even with 2.1 on the model, but I can't prove that at the moment. My guess is that you can reproduce by creating an OpenStack-based model with 2.0, then upgrading the controller to 2.1, then trying to deploy anything more to the model.

Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Heather Lanigan (hmlanigan)
affects: juju-core → juju
Changed in juju:
milestone: none → 2.2-alpha1
importance: Undecided → High
status: New → Triaged
Revision history for this message
Ian Booth (wallyworld) wrote :

The issue is that some logic was added to account for changes to the security
group rules. Instead of updating any existing group with the changed rules, a
new one is always created. So there was a logic error introduced there. The
issue only applies to neutron, not nova groups. The code fix is to simply replace the
rules in the existing group like is done for AWS for example.

To fix an existing deployment with this issue, you need to delete the wrong security group and ensure existing machines have the correct group. More explicit instructions will be posted on how to do that.

Revision history for this message
Colin Watson (cjwatson) wrote :
Download full text (3.5 KiB)

Here are some home-grown explicit instructions on ensuring that an existing global security group is "correct" from the point of view of Juju 2.1. (This is in no way approved by the Juju team, but I wanted to write up what I did.)

Juju currently requires an existing global security group to have exactly the following ingress rules (no more, no less; notation in neutron style; I believe the API port is configurable at some level and may differ from the default of 17070):

  { "direction": "ingress", "protocol": "tcp", "port_range_min": 22, "port_range_max": 22, "remote_ip_prefix": "0.0.0.0/0", "ethertype": "IPv4" }
  { "direction": "ingress", "protocol": "tcp", "port_range_min": 22, "port_range_max": 22, "remote_ip_prefix": "::/0", "ethertype": "IPv6" }
  { "direction": "ingress", "protocol": "tcp", "port_range_min": 17070, "port_range_max": 17070, "remote_ip_prefix": "0.0.0.0/0", "ethertype": "IPv4" }
  { "direction": "ingress", "protocol": "tcp", "port_range_min": 17070, "port_range_max": 17070, "remote_ip_prefix": "::/0", "ethertype": "IPv6" }
  { "direction": "ingress", "protocol": "tcp", "port_range_min": 1, "port_range_max": 65535, "ethertype": "IPv4" }
  { "direction": "ingress", "protocol": "tcp", "port_range_min": 1, "port_range_max": 65535, "ethertype": "IPv6" }
  { "direction": "ingress", "protocol": "udp", "port_range_min": 1, "port_range_max": 65535, "ethertype": "IPv4" }
  { "direction": "ingress", "protocol": "udp", "port_range_min": 1, "port_range_max": 65535, "ethertype": "IPv6" }
  { "direction": "ingress", "protocol": "icmp", "ethertype": "IPv4" }
  { "direction": "ingress", "protocol": "icmp", "ethertype": "IPv6" }

So. Let TENANT_ID be the basename of the publicURL field in the output of "keystone catalog --service=compute", and SECGROUP_ID be the ID of the original global security group shown in "neutron security-group-list" (whose name will be something like juju-4955af7b-a72f-4076-88e4-99854e92d50f-569d4ba6-1b1d-4837-8190-4a902ba34d61, i.e. without a -<machine-id> suffix). Then add the missing rules from 2.0:

  neutron security-group-rule-create --tenant-id $TENANT_ID --direction ingress --ethertype IPv6 --protocol icmp --remote-group-id $SECGROUP_ID $SECGROUP_ID
  neutron security-group-rule-create --tenant-id $TENANT_ID --direction ingress --ethertype IPv6 --protocol tcp --port-range-min 1 --port-range-max 65535 --remote-group-id $SECGROUP_ID $SECGROUP_ID
  neutron security-group-rule-create --tenant-id $TENANT_ID --direction ingress --ethertype IPv6 --protocol udp --port-range-min 1 --port-range-max 65535 --remote-group-id $SECGROUP_ID $SECGROUP_ID
  neutron security-group-rule-create --tenant-id $TENANT_ID --direction ingress --ethertype IPv4 --protocol tcp --port-range-min 22 --port-range-max 22 --remote-ip-prefix 0.0.0.0/0 $SECGROUP_ID
  neutron security-group-rule-create --tenant-id $TENANT_ID --direction ingress --ethertype IPv6 --protocol tcp --port-range-min 22 --port-range-max 22 --remote-ip-prefix ::/0 $SECGROUP_ID
  neutron security-group-rule-create --tenant-id $TENANT_ID --direction ingress --ethertype IPv6 --protocol tcp --port-range-min 17070 --port-range-max 17070 --remote-ip-prefix ::/0 $SECGROUP...

Read more...

Changed in juju:
status: Triaged → In Progress
John A Meinel (jameinel)
Changed in juju:
importance: High → Critical
Revision history for this message
Colin Watson (cjwatson) wrote :

Canonical staff affected by this bug in prodstack may wish to look at the "Mitigating Juju 2.1 security group handling bug" email that I sent to the canonical-tech and is-discuss lists today.

Ryan Beisner (1chb1n)
tags: added: openstack-provider uosci
Ryan Beisner (1chb1n)
tags: added: upgrade-juju
tags: added: eda
Revision history for this message
Heather Lanigan (hmlanigan) wrote :
Revision history for this message
Roger Peppe (rogpeppe) wrote :

BTW those "silly byte-by-byte UserData serialisations" are a potentially
serious security hole - the user data contains (amongst much else) the
the agent password.

Changed in juju:
milestone: 2.2-alpha1 → 2.2-beta1
Revision history for this message
Christian Reis (kiko) wrote :

Roger, is there a bug to track the information disclosure issue you pointed out?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Fix landed in 2.2-beta 1 as part of a bigger merge in PR: https://github.com/juju/juju/pull/7141

Changed in juju:
assignee: Heather Lanigan (hmlanigan) → Ian Booth (wallyworld)
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.