Ringsync puppet failed during storage node installation in HA deploys

Bug #1218981 reported by Chip
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cisco Openstack
Confirmed
Medium
Chip
Grizzly
Fix Released
Medium
Chip

Bug Description

During installation of swift storage nodes a puppet failure occurs.

err: Could not retrieve catalog from remote server: Error 400 on Server
Exported resource Swift::Ringsync[account] cannot override local resource on node <node_name>

The current HA documentation suggests commenting out
Swift::Ringsync<<||>> in swift-storage.pp

Commenting this out is not a solution as it disables the ring sync causing start up failures on the storage nodes.

Manually copying the rings from the proxy server allows the nodes to operate properly.

Tags: ha
Chip (cbaesema)
Changed in openstack-cisco:
milestone: none → g.2
Revision history for this message
Shweta P (shweta-ap05) wrote :

Observed on my setup too. Copying the rings and restarting all services worked in my setup as well.

Revision history for this message
Mark T. Voelker (mvoelker) wrote :

Both of these failures were seen on HA deployments, correct?

Revision history for this message
Chip (cbaesema) wrote :

Yes both of these failures where seen on a HA deployment.

Changing the order of launch proxy, storage or storage, proxy makes no difference.

Revision history for this message
Chris Ricker (chris-ricker) wrote :

This also occurs on non-HA. When we run puppet on the proxy, we get:

err: Could not retrieve catalog from remote server: Error 400 on SERVER: Exported resource Ring_object_device[172.29.75.141:6000/sdb] cannot override local resource on node ci-os-sw-proxy1.ctocllab.cisco.com

Revision history for this message
Chris Ricker (chris-ricker) wrote :

Ignore https://bugs.launchpad.net/openstack-cisco/+bug/1218981/comments/4 -- I had a mistake in site.pp

I don't get this issue on non-HA when I have a proper site.pp

Revision history for this message
Mark T. Voelker (mvoelker) wrote :

I can also verify that this doesn't occur in non-HA deployments.

Changed in openstack-cisco:
status: New → Triaged
importance: Undecided → Medium
tags: added: ha
Revision history for this message
Daneyon Hansen (danehans) wrote :
Download full text (4.9 KiB)

I built my HA environment from scratch last night, using the latest modules and manifests. My Swift nodes were deployed without issue:

From Proxy 1

root@swiftproxy01:~# swift-ring-builder /etc/swift/account.builder
/etc/swift/account.builder, build version 15
262144 partitions, 3.000000 replicas, 1 regions, 3 zones, 15 devices, 0.00 balance
The minimum number of hours before a partition can be reassigned is 1
Devices: id region zone ip address port name weight partitions balance meta
             0 1 3 192.168.222.73 6002 sde 1.00 52429 0.00
             1 1 2 192.168.222.72 6002 sdd 1.00 52428 -0.00
             2 1 3 192.168.222.73 6002 sdc 1.00 52429 0.00
             3 1 2 192.168.222.72 6002 sdb 1.00 52429 0.00
             4 1 3 192.168.222.73 6002 sdb 1.00 52429 0.00
             5 1 1 192.168.222.71 6002 sdb 1.00 52429 0.00
             6 1 1 192.168.222.71 6002 sdc 1.00 52429 0.00
             7 1 2 192.168.222.72 6002 sdf 1.00 52429 0.00
             8 1 1 192.168.222.71 6002 sdd 1.00 52429 0.00
             9 1 2 192.168.222.72 6002 sdc 1.00 52429 0.00
            10 1 1 192.168.222.71 6002 sde 1.00 52428 -0.00
            11 1 1 192.168.222.71 6002 sdf 1.00 52429 0.00
            12 1 3 192.168.222.73 6002 sdf 1.00 52429 0.00
            13 1 2 192.168.222.72 6002 sde 1.00 52429 0.00
            14 1 3 192.168.222.73 6002 sdd 1.00 52428 -0.00
root@swiftproxy01:~# swift-ring-builder /etc/swift/container.builder
/etc/swift/container.builder, build version 15
262144 partitions, 3.000000 replicas, 1 regions, 3 zones, 15 devices, 0.00 balance
The minimum number of hours before a partition can be reassigned is 1
Devices: id region zone ip address port name weight partitions balance meta
             0 1 2 192.168.222.72 6001 sdb 1.00 52429 0.00
             1 1 1 192.168.222.71 6001 sdc 1.00 52429 0.00
             2 1 3 192.168.222.73 6001 sdf 1.00 52429 0.00
             3 1 3 192.168.222.73 6001 sdb 1.00 52429 0.00
             4 1 3 192.168.222.73 6001 sde 1.00 52429 0.00
             5 1 1 192.168.222.71 6001 sdb 1.00 52429 0.00
             6 1 2 192.168.222.72 6001 sdc 1.00 52429 0.00
             7 1 1 192.168.222.71 6001 sdf 1.00 52429 0.00
             8 1 2 192.168.222.72 6001 sdf 1.00 52429 0.00
             9 1 1 192.168.222.71 6001 sdd 1.00 52429 0.00
            10 1 1 192.168.222.71 6001 sde 1.00 52428 -0.00
            11 1 2 192.168.222.72 6001 sde 1.00 52429 0.00
            12 1 2 192.16...

Read more...

Revision history for this message
Mark T. Voelker (mvoelker) wrote :

OK, so current status is that:
1.) We don't see this on non-HA
2.) Not everyone sees this on HA

And as a workaround in case you do run into this, you can copy the ring manually by doing something like:

scp /etc/swift/*.gz swift-storage01:/etc/swift/
scp /etc/swift/*.gz swift-storage02:/etc/swift/
scp /etc/swift/*.gz swift-storage03:/etc/swift/

ssh swift-storage01 "swift-init all restart"
ssh swift-storage02 "swift-init all restart"
ssh swift-storage03 "swift-init all restart"

Given all that, I'm going to retarget as I don't think this is a showstopper for today's scheduled release.

Changed in openstack-cisco:
milestone: g.2 → g.3
Revision history for this message
Chip (cbaesema) wrote :

Pull https://github.com/CiscoSystems/puppet-openstack/pull/29 for testing before submission to upstream.

Revision history for this message
Daneyon Hansen (danehans) wrote :

I think we are missing a dependency. I should be able to have a clean puppet run after the initial storage run and proxy run, but I get the following errors:

err: /Service[swift-container-replicator]/ensure: change from stopped to running failed: Could not start Service[swift-container-replicator]: Execution of '/sbin/start swift-container-replicator' returned 1: at /usr/share/puppet/modules/swift/manifests/storage/generic.pp:61

err: /Service[swift-object-replicator]/ensure: change from stopped to running failed: Could not start Service[swift-object-replicator]: Execution of '/sbin/start swift-object-replicator' returned 1: at /usr/share/puppet/modules/swift/manifests/storage/generic.pp:61

err: /Service[swift-account-replicator]/ensure: change from stopped to running failed: Could not start Service[swift-account-replicator]: Execution of '/sbin/start swift-account-replicator' returned 1: at /usr/share/puppet/modules/swift/manifests/storage/generic.pp:61

err: /Service[swift-container-sync]/ensure: change from stopped to running failed: Could not start Service[swift-container-sync]: Execution of '/sbin/start swift-container-sync' returned 1: at /usr/share/puppet/modules/swift/manifests/storage/container.pp:45

notice: /Stage[main]/Openstack::Swift::Storage-node/Swift::Ringsync[account]/Rsync::Get[/etc/swift/account.ring.gz]/Exec[rsync /etc/swift/account.ring.gz]/returns: executed successfully

notice: /Stage[main]/Openstack::Swift::Storage-node/Swift::Ringsync[object]/Rsync::Get[/etc/swift/object.ring.gz]/Exec[rsync /etc/swift/object.ring.gz]/returns: executed successfully

notice: /Stage[main]/Openstack::Swift::Storage-node/Swift::Ringsync[container]/Rsync::Get[/etc/swift/container.ring.gz]/Exec[rsync /etc/swift/container.ring.gz]/returns: executed successfully

After i perform a 2nd puppet run on the storage nodes, the errors clear-up.

Changed in openstack-cisco:
assignee: nobody → Chip (cbaesema)
status: Triaged → In Progress
Revision history for this message
Chip (cbaesema) wrote :

Upstream Bug: puppet-openstack 1224592

Revision history for this message
Chip (cbaesema) wrote :

Upstream proposed patch 47085

Revision history for this message
Mark T. Voelker (mvoelker) wrote :
summary: - Ringsync puppet failured during storage node installation
+ Ringsync puppet failured during storage node installation in HA deploys
summary: - Ringsync puppet failured during storage node installation in HA deploys
+ Ringsync puppet failed during storage node installation in HA deploys
Changed in openstack-cisco:
status: In Progress → Fix Committed
Revision history for this message
Shweta P (shweta-ap05) wrote :

I see this on i.1 full HA setup as well.

My workaround was to add this line in user.full_ha.yaml, where I set ring_server to the real IP of the first swift proxy server(not VIP)
openstack::swift::storage-node::ring_server: 172.29.XX.XXX

If this additional line was not added, the storage nodes would try to reach out to the VIP to copy over the common files, but we really need to be pointing to one of the swift proxy servers.

Once this was done subsequent puppet runs on the storage nodes was able to copy over the expected files.

Changed in openstack-cisco:
status: Fix Released → Confirmed
milestone: g.3 → i.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.