Swift proxy cannot start due to missing ringfiles

Bug #1496004 reported by Chad Smith
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Landscape Server
Fix Released
High
David Britton
15.07
Fix Released
Critical
David Britton
swift-proxy (Juju Charms Collection)
Invalid
Undecided
Unassigned
swift-storage (Juju Charms Collection)
Fix Released
High
Liam Young

Bug Description

On a ceph/swift autopilot deployment of Juno ring files are not created resulting in the inability to start swift-proxy service

juju config for swift-proxy:
    openstack-origin:cloud:trusty-juno

Attached are the /etc/swift/swift-proxy.cfg and juju unit-swift-proxy/0 .log files

swift-proxy/0 claims it is leader per juju logs:
 ... "Leader established, generating ring builders" ...

 yet no rings are created on this unit.

/var/log/upstart/swift-proxy.log:

No proxy-server running
Starting proxy-server...(/etc/swift/proxy-server.conf)
Traceback (most recent call last):
  File "/usr/bin/swift-proxy-server", line 23, in <module>
    sys.exit(run_wsgi(conf_file, 'proxy-server', **options))
  File "/usr/lib/python2.7/dist-packages/swift/common/wsgi.py", line 445, in run_wsgi
    loadapp(conf_path, global_conf=global_conf)
  File "/usr/lib/python2.7/dist-packages/swift/common/wsgi.py", line 357, in loadapp
    app = ctx.app_context.create()
  File "/usr/lib/python2.7/dist-packages/paste/deploy/loadwsgi.py", line 710, in create
    return self.object_type.invoke(self)
  File "/usr/lib/python2.7/dist-packages/paste/deploy/loadwsgi.py", line 146, in invoke
    return fix_call(context.object, context.global_conf, **context.local_conf)
  File "/usr/lib/python2.7/dist-packages/paste/deploy/util.py", line 55, in fix_call
    val = callable(*args, **kw)
  File "/usr/lib/python2.7/dist-packages/swift/proxy/server.py", line 566, in app_factory
    app = Application(conf)
  File "/usr/lib/python2.7/dist-packages/swift/proxy/server.py", line 104, in __init__
    ring_name='container')
  File "/usr/lib/python2.7/dist-packages/swift/common/ring/ring.py", line 152, in __init__
    self._reload(force=True)
  File "/usr/lib/python2.7/dist-packages/swift/common/ring/ring.py", line 157, in _reload
    ring_data = RingData.load(self.serialized_path)
  File "/usr/lib/python2.7/dist-packages/swift/common/ring/ring.py", line 65, in load
    gz_file = GzipFile(filename, 'rb')
  File "/usr/lib/python2.7/gzip.py", line 94, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
IOError: [Errno 2] No such file or directory: '/etc/swift/container.ring.gz'

Related branches

Revision history for this message
Chad Smith (chad.smith) wrote :
Revision history for this message
Chad Smith (chad.smith) wrote :
tags: added: cloud-install-failure
Revision history for this message
Chad Smith (chad.smith) wrote :
Revision history for this message
Chad Smith (chad.smith) wrote :
Revision history for this message
Chad Smith (chad.smith) wrote :

Also affected kilo openstack origin deployments as well.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Here are the cloud logs from another run that failed.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

juju status shows no errors.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Feels weird that the service startup is signaling the error, but there are no charm hook errors:
root@juju-machine-1-lxc-2:~# service swift-proxy start
start: Job failed to start
root@juju-machine-1-lxc-2:~# echo $?
1

Starting proxy-server...(/etc/swift/proxy-server.conf)
Traceback (most recent call last):
  File "/usr/bin/swift-proxy-server", line 23, in <module>
    sys.exit(run_wsgi(conf_file, 'proxy-server', **options))
  File "/usr/lib/python2.7/dist-packages/swift/common/wsgi.py", line 445, in run_wsgi
    loadapp(conf_path, global_conf=global_conf)
  File "/usr/lib/python2.7/dist-packages/swift/common/wsgi.py", line 357, in loadapp
    app = ctx.app_context.create()
  File "/usr/lib/python2.7/dist-packages/paste/deploy/loadwsgi.py", line 710, in create
    return self.object_type.invoke(self)
  File "/usr/lib/python2.7/dist-packages/paste/deploy/loadwsgi.py", line 146, in invoke
    return fix_call(context.object, context.global_conf, **context.local_conf)
  File "/usr/lib/python2.7/dist-packages/paste/deploy/util.py", line 55, in fix_call
    val = callable(*args, **kw)
  File "/usr/lib/python2.7/dist-packages/swift/proxy/server.py", line 580, in app_factory
    app = Application(conf)
  File "/usr/lib/python2.7/dist-packages/swift/proxy/server.py", line 106, in __init__
    ring_name='container')
  File "/usr/lib/python2.7/dist-packages/swift/common/ring/ring.py", line 152, in __init__
    self._reload(force=True)
  File "/usr/lib/python2.7/dist-packages/swift/common/ring/ring.py", line 157, in _reload
    ring_data = RingData.load(self.serialized_path)
  File "/usr/lib/python2.7/dist-packages/swift/common/ring/ring.py", line 65, in load
    gz_file = GzipFile(filename, 'rb')
  File "/usr/lib/python2.7/gzip.py", line 94, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
IOError: [Errno 2] No such file or directory: '/etc/swift/container.ring.gz'

Signal proxy-server pid: 38711 signal: 15
No proxy-server running

Changed in landscape:
importance: Undecided → High
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

swift proxy charm config as deployed

no longer affects: landscape/cisco-odl
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi Andreas, can you give me the output of the following from swift-proxy/0:

 * sudo swift-ring-builder /etc/swift/object.builder

 * sudo python -c "import cPickle as pickle; print pickle.load(open('/etc/swift/object.builder', 'rb'))['replicas']"

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

root@juju-machine-2-lxc-1:~# swift-ring-builder /etc/swift/object.builder
/etc/swift/object.builder, build version 0
256 partitions, 3.000000 replicas, 0 regions, 0 zones, 0 devices, 0.00 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 0
The overload factor is 0.00% (0.000000)

root@juju-machine-2-lxc-1:~# python -c "import cPickle as pickle; print pickle.load(open('/etc/swift/object.builder', 'rb'))['replicas']"
3

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This is the current juju status for the environment of the above output. This is a new deploy where the same problem happened, so previous logs attached to this bug do not correspond to it.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

swift-proxy/0 is the leader:
$ juju run --service swift-proxy is-leader
- MachineId: 2/lxc/1
  Stdout: |
    True
  UnitId: swift-proxy/0
- MachineId: 4/lxc/2
  Stdout: |
    False
  UnitId: swift-proxy/1
- MachineId: 5/lxc/3
  Stdout: |
    False
  UnitId: swift-proxy/2

So the commands from comment #11 were run on the leader. Repeating here:

root@juju-machine-2-lxc-1:~# swift-ring-builder /etc/swift/object.builder
/etc/swift/object.builder, build version 0
256 partitions, 3.000000 replicas, 0 regions, 0 zones, 0 devices, 0.00 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 0
The overload factor is 0.00% (0.000000)

root@juju-machine-2-lxc-1:~# python -c "import cPickle as pickle; print pickle.load(open('/etc/swift/object.builder', 'rb'))['replicas']"
3

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Andreas, having looked at the logs and info you have provided, it appears that the problem is due to the fact that you only have 1 storage zone (with 3 units in it) and your builder requires 3 replicas (with at least one in each zone). So, you have two options; option 1 is to set the 'replicas' to 3 (not recommended) or option 2 is to deploy 3 zones of storage. If you only have 3 storage hosts at your disposal you can take your 3 nodes and deploy as 3 services e.g.. storage-zone1, storage-zone2 etc.

Changed in swift-proxy (Juju Charms Collection):
status: New → Invalid
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Sorry option 1 above should say set 'replicas' to 1

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Ok so a bit more info. Deploying a single storage service with <replicas> units will work if zone-assignment is 'auto'. I have tried this with Icehouse and Kilo and it works fine. Taking another look at your proxy logs i see this each time a storage relation is joining:

2015-09-15 11:11:13 INFO juju-log swift-storage:47: Leader established, updating ring builders
2015-09-15 11:11:13 INFO juju-log swift-storage:47: Relation not ready - some required values not provided by relation (missing=object_port, account_port, container_port)

So the rings never get built etc etc. So, we need to figure out why the storage units are not providing this info. Can you please provide the juju unit log from on of the storage units and also the output of:

juju run --unit swift-proxy/0 "relation-ids swift-storage"
juju run --unit swift-proxy/0 "relation-get -r <rid-from-above> - swift-storage/0"

Changed in swift-proxy (Juju Charms Collection):
status: Invalid → Incomplete
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

$ juju run --unit swift-proxy/0 "relation-ids swift-storage"
swift-storage:46

$ juju run --unit swift-proxy/0 "relation-get -r swift-storage:46 - swift-storage/0"
account_port: "6002"
container_port: "6001"
object_port: "6000"
private-address: 10.96.10.183
zone: "1"

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Changed in swift-proxy (Juju Charms Collection):
status: Incomplete → New
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Andreas the swift-proxy and swift-storage logs do not appear to be from the same deployment since their timestamps and relation ids not match. If i am to cross-reference the logs they need to be from the same deployment.

Changed in swift-proxy (Juju Charms Collection):
status: New → Incomplete
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This bug has logs from two deployments. The only logs that match the current deployment that is still live are the the ones I uploaded today, meaning, the last juju status and the three storage logs.

The cloud-logs.tar.xz attachment has all logs (but no juju run output) from yesterday's deployment.

I'll attach a new bundle for today's deployment then, it will be a few dozen megabytes.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This has all the swift-* logs from the deployment that is currently up.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Andreas I just need the following:

  1. juju log of swift-proxy leader unit
  2. juju log of any one swift-storage unit
  3. output of relation between units from 1. and 2. using juju run command in comment #16
  4. swift-ring-builder object.builder output from leader proxy unit

Changed in swift-proxy (Juju Charms Collection):
status: Incomplete → New
David Britton (dpb)
tags: added: landscape-release-29
no longer affects: landscape/release-29
no longer affects: landscape
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This deploy is now 4 days old, logs have rotated and it will be hard to find the right info. I'll just deploy again.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Ok, new deployment. Ignore any logs from before this comment.

Relation commands from (3):
landscape@juju-machine-0-lxc-1:/tmp/bug-1496004$ juju run --unit swift-proxy/0 "relation-ids swift-storage"
swift-storage:89

landscape@juju-machine-0-lxc-1:/tmp/bug-1496004$ juju run --unit swift-proxy/0 "relation-get -r swift-storage:89 - swift-storage/0"
account_port: "6002"
container_port: "6001"
object_port: "6000"
private-address: 10.96.10.3
zone: "1"

swift-ring-builder from (4):
root@juju-machine-0-lxc-3:/etc/swift# swift-ring-builder object.builder
object.builder, build version 0
256 partitions, 3.000000 replicas, 0 regions, 0 zones, 0 devices, 0.00 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 0
The overload factor is 0.00% (0.000000)

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

juju log from swift-storage/0

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

juju log from swift-proxy/0, the leader

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Leader verification:

$ juju run --service swift-proxy is-leader
- MachineId: 0/lxc/3
  Stdout: |
    True
  UnitId: swift-proxy/0
- MachineId: 1/lxc/4
  Stdout: |
    False
  UnitId: swift-proxy/1
- MachineId: 3/lxc/5
  Stdout: |
    False
  UnitId: swift-proxy/2

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

And juju status for completeness.

no longer affects: landscape/cisco-odl
Changed in landscape:
importance: Undecided → High
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Ok, the problem here is simple. The storage units are not declaring that they have any devices configured. Looking at the swift-storage log I see:

2015-09-22 13:50:13 INFO juju-log Valid ensured block devices: ['/dev/sdb']
2015-09-22 13:50:13 INFO install Failed to read physical volume "/dev/sdb"
2015-09-22 13:50:15 INFO install GPT data structures destroyed! You may now partition the disk using fdisk or
2015-09-22 13:50:15 INFO install other utilities.
2015-09-22 13:50:16 INFO install Creating new GPT entries.
...

Then a while later, when the proxy joins I see:
2015-09-22 14:20:39 INFO juju-log swift-storage:89: Valid ensured block devices: []
...

Which explains why the proxy does not receive any device info. Question is why is /dev/sdb no longer found or considered valid? Can you (a) confirm that block-device is still set to '/dev/sdb' and (b) check that /dev/sdb is a valid block device?

Changed in swift-proxy (Juju Charms Collection):
status: New → Invalid
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Can you provide you swift-storage config as well?

Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :

None of the /dev/sdb devices are partitioned:

landscape@juju-machine-0-lxc-1:/home/ubuntu$ juju run --service swift-storage 'cat /proc/partitions'
- MachineId: "2"
  Stdout: |
    major minor #blocks name

       8 0 117220824 sda
       8 1 117219800 sda1
       8 16 976762584 sdb
  UnitId: swift-storage/0
- MachineId: "3"
  Stdout: |
    major minor #blocks name

       8 0 117220824 sda
       8 1 117219800 sda1
       8 16 976762584 sdb
  UnitId: swift-storage/1
- MachineId: "4"
  Stdout: |
    major minor #blocks name

       8 0 117220824 sda
       8 1 117219800 sda1
       8 16 976762584 sdb
  UnitId: swift-storage/2
landscape@juju-machine-0-lxc-1:/home/ubuntu$

Revision history for this message
David Britton (dpb) wrote :

After setting "block-device=sda sdb sdc sdd ..." on a new deploy, I get the same behavior. If anyone has time to debug it, I appear to be able to reproduce this every time.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

/dev/sdb is no longer "found" because it's mounted the second time find_block_devices() is called :

/dev/sdb on /srv/node/sdb type xfs (rw)

Did some logic change there in the charm?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Deploying swift-storage with a list like "sdb sdc sdd" instead of "guess" for block-device worked. Note I excluded sda.

My analysis so far:

The install hook calls setup_storage(), which uses determine_block_devices() to get which block devices it should use.

When using "guess", determine_block_devices() will call out to find_block_devices() to get a list of devices it can use. That last method excludes devices that are mounted. Since this is the install hook, sdb (in my deployment) is "free" and is returned in the list. It's then formatted and mounted.

Later the swift-storage relation is joined. That brings us to swift_storage_relation_joined(), which again uses determine_block_devices() to get a list of devices. This time, however, /dev/sdb is mounted (setup_storage() did that in the install hook), so "sdb" is not returned. As a result, the relation_set() call does not have any device for the "device" relation parameter, which remains empty and that breaks the deployment.

When using a list of devices instead of "guess" for the block-device charm config parameter, then determine_block_devices() just ensures that each device in that list is a block device, not bothering to check if it's mounted or not. That's why "sdb" is returned in the list when the swift-storage relation is joined, and set correctly in the relation:

landscape@juju-machine-0-lxc-17:~$ juju run --service swift-proxy is-leader
- MachineId: 2/lxc/2
  Stdout: |
    True
  UnitId: swift-proxy/0
- MachineId: 3/lxc/5
  Stdout: |
    False
  UnitId: swift-proxy/1
- MachineId: 4/lxc/1
  Stdout: |
    False
  UnitId: swift-proxy/2
landscape@juju-machine-0-lxc-17:~$ juju run --unit swift-proxy/0 "relation-ids swift-storage"
swift-storage:59
landscape@juju-machine-0-lxc-17:~$ juju run --unit swift-proxy/0 "relation-get -r swift-storage:59 - swift-storage/0"
account_port: "6002"
container_port: "6001"
device: sdb <============== THERE YOU ARE
object_port: "6000"
private-address: 10.96.10.188
zone: "1"

What I don't know yet is what changed in the past few months that is causing this. I checked bzr log and blame on these methods and they seem to have been like this forever.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Download full text (3.3 KiB)

Another datapoint is that a non-HA deployment with juju 1.22, swift-storage-11 and swift-proxy-8 (charm store revisions) worked using "guess" for block-device. The code around that area in these older charm revisions looks similar, so I'm leaning towards something related to HA.

$ juju run --unit swift-proxy/0 "relation-get -r swift-storage:54 - swift-storage/0"
account_port: "6002"
container_port: "6001"
device: sdb
object_port: "6000"
private-address: 10.96.10.188
zone: "1"

juju get swift-storage:
  block-device:
    description: |
(...)
    type: string
    value: guess

swift-storage logs. Note that while in swift-storage relation joined, sdb is still considered a valid block device:

2015-09-29 21:04:26 INFO unit.swift-storage/0.juju-log cmd.go:247 Valid ensured block devices: ['/dev/sdb']
2015-09-29 21:04:26 INFO unit.swift-storage/0.install logger.go:40 Failed to read physical volume "/dev/sdb"
2015-09-29 21:04:28 INFO unit.swift-storage/0.install logger.go:40 GPT data structures destroyed! You may now partition the disk using fdisk or
2015-09-29 21:04:28 INFO unit.swift-storage/0.install logger.go:40 other utilities.
2015-09-29 21:04:28 INFO unit.swift-storage/0.install logger.go:40 The operation has completed successfully.
2015-09-29 21:04:28 INFO unit.swift-storage/0.install logger.go:40 1+0 records in
2015-09-29 21:04:28 INFO unit.swift-storage/0.install logger.go:40 1+0 records out
2015-09-29 21:04:28 INFO unit.swift-storage/0.install logger.go:40 1048576 bytes (1.0 MB) copied, 0.0235954 s, 44.4 MB/s
2015-09-29 21:04:28 INFO unit.swift-storage/0.install logger.go:40 100+0 records in
2015-09-29 21:04:28 INFO unit.swift-storage/0.install logger.go:40 100+0 records out
2015-09-29 21:04:28 INFO unit.swift-storage/0.install logger.go:40 51200 bytes (51 kB) copied, 0.0223138 s, 2.3 MB/s
2015-09-29 21:04:49 INFO unit.swift-storage/0.install logger.go:40 meta-data=/dev/sdb isize=1024 agcount=4, agsize=61047662 blks
2015-09-29 21:04:49 INFO unit.swift-storage/0.install logger.go:40 = sectsz=512 attr=2, projid32bit=0
2015-09-29 21:04:49 INFO unit.swift-storage/0.install logger.go:40 data = bsize=4096 blocks=244190646, imaxpct=25
2015-09-29 21:04:49 INFO unit.swift-storage/0.install logger.go:40 = sunit=0 swidth=0 blks
2015-09-29 21:04:49 INFO unit.swift-storage/0.install logger.go:40 naming =version 2 bsize=4096 ascii-ci=0
2015-09-29 21:04:49 INFO unit.swift-storage/0.install logger.go:40 log =internal log bsize=4096 blocks=119233, version=2
2015-09-29 21:04:49 INFO unit.swift-storage/0.install logger.go:40 = sectsz=512 sunit=0 blks, lazy-count=1
2015-09-29 21:04:49 INFO unit.swift-storage/0.install logger.go:40 realtime =none extsz=4096 blocks=0, rtextents=0
2015-09-29 21:04:49 INFO unit.swift-storage/0.juju-log cmd.go:247 Making dir /srv/node/sdb swift:swift 555
2015-09-29 21:04:49 INFO unit.swift-storage/0.install logger.go:40 <open file '/proc/partitions', mode 'r' at 0x7f8199751c90>
(...)
2015-09-29 21:16:29 INFO unit.swift-storage/0....

Read more...

Liam Young (gnuoy)
Changed in swift-storage (Juju Charms Collection):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
Liam Young (gnuoy) wrote :

I can reproduce this in a virtual environment with: http://paste.ubuntu.com/12626039/

David Britton (dpb)
tags: added: kanban-cross-team
tags: removed: kanban-cross-team
David Britton (dpb)
tags: added: kanban-cross-team
David Britton (dpb)
tags: removed: kanban-cross-team
Changed in swift-storage (Juju Charms Collection):
status: Confirmed → In Progress
Changed in swift-storage (Juju Charms Collection):
milestone: none → 15.10
status: In Progress → Fix Committed
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

We will need a backport of this fix to stable.

tags: added: backport-potential
David Britton (dpb)
Changed in landscape:
status: New → Fix Committed
status: Fix Committed → New
Revision history for this message
David Britton (dpb) wrote :

Fix in Landscape release-29:r9093, trunk:r9270

Changed in landscape:
status: New → Fix Committed
assignee: nobody → David Britton (davidpbritton)
milestone: none → 15.08
tags: removed: landscape-release-29
Changed in landscape:
status: Fix Committed → Fix Released
milestone: 15.08 → 15.07
James Page (james-page)
Changed in swift-storage (Juju Charms Collection):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.