Ceph upgrade failed during minor (Rocky to Rocky) overcloud upgrade

Bug #1832509 reported by Anton Antonov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
John Fulton

Bug Description

Description:

"openstack overcloud external-update run --tags ceph" command fails when trying to do a minor Rocky to Rocky upgrade.

Steps:

1. Upgraded the undercloud
2. Prepared the update
  openstack overcloud update prepare ...
2. Upgraded all overcloud nodes
  openstack overcloud update run --nodes Controller
  openstack overcloud update run --nodes Compute
  openstack overcloud update run --nodes CephStorage

3. Run "openstack overcloud external-update run --tags ceph"

Expected results:

A successfully updated containerized Red Hat Ceph Storage 3 cluster.

Actual results:

'''
TASK [set facts for swift back up of ceph-ansible fetch directory] *************
Tuesday 11 June 2019 22:44:07 +0100 (0:00:00.068) 0:01:02.993 **********
ok: [undercloud] => {"ansible_facts": {"new_ceph_ansible_tarball_name": "temporary_dir_new.tar.gz", "old_ceph_ansible_tarball_name"
: "temporary_dir_old.tar.gz", "swift_get_url": "https://10.35.5.2:13808/v1/AUTH_0d06a24bb33c4b9ebf922cb7c3bcf118/overcloud_ceph_ans
ible_fetch_dir/temporary_dir.tar.gz?temp_url_sig=7fd5deec39a646cc71b094dd6a0ae8bc8df3d4b6&temp_url_expires=1560282811", "swift_put_
url": "https://10.35.5.2:13808/v1/AUTH_0d06a24bb33c4b9ebf922cb7c3bcf118/overcloud_ceph_ansible_fetch_dir/temporary_dir.tar.gz?temp_
url_sig=caea8c31d996d2bc07e5a2152cd85d42d81f2d60&temp_url_expires=1560282839"}, "changed": false}

TASK [attempt download of fetch directory tarball from swift backup] ***********
Tuesday 11 June 2019 22:44:07 +0100 (0:00:00.079) 0:01:03.073 **********
 [WARNING]: Consider using the get_url or uri module rather than running curl.
If you need to use command because get_url or uri is insufficient you can add
warn=False to this command task or set command_warnings=False in ansible.cfg to
get rid of this message.
changed: [undercloud] => {"changed": true, "cmd": "curl -s -o /tmp/temporary_dir_old.tar.gz -w '%{http_code}' -X GET \"https://10.3
5.5.2:13808/v1/AUTH_0d06a24bb33c4b9ebf922cb7c3bcf118/overcloud_ceph_ansible_fetch_dir/temporary_dir.tar.gz?temp_url_sig=7fd5deec39a
646cc71b094dd6a0ae8bc8df3d4b6&temp_url_expires=1560282811\"", "delta": "0:00:00.183758", "end": "2019-06-11 22:44:07.674133", "rc":
 0, "start": "2019-06-11 22:44:07.490375", "stderr": "", "stderr_lines": [], "stdout": "401", "stdout_lines": ["401"]}

TASK [ensure we create a new fetch_directory or use the old fetch_directory] ***
Tuesday 11 June 2019 22:44:07 +0100 (0:00:00.409) 0:01:03.482 **********
fatal: [undercloud]: FAILED! => {"changed": false, "msg": "Received HTTP: 401 when attempting to GET from https://10.35.5.2:13808/v
1/AUTH_0d06a24bb33c4b9ebf922cb7c3bcf118/overcloud_ceph_ansible_fetch_dir/temporary_dir.tar.gz?temp_url_sig=7fd5deec39a646cc71b094dd
6a0ae8bc8df3d4b6&temp_url_expires=1560282811"}

NO MORE HOSTS LEFT *************************************************************
```

Environment:

openstack-tripleo-puppet-elements-9.0.2-0.20190425202749.1ab58f2.el7.noarch
ceph-ansible-3.2.5-1.el7.noarch
python-tripleoclient-10.6.2-0.20190522234411.12e3f68.el7.noarch
openstack-tripleo-common-containers-9.5.1-0.20190507224322.cd24177.el7.noarch
ansible-tripleo-ipsec-9.1.1-0.20190513182453.ffe104c.el7.noarch
python2-tripleo-common-9.5.1-0.20190507224322.cd24177.el7.noarch
openstack-tripleo-common-9.5.1-0.20190507224322.cd24177.el7.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20190531141856.f33cad7.el7.noarch
openstack-tripleo-image-elements-9.1.1-0.20190420053043.aa75390.el7.noarch
puppet-ceph-2.6.1-0.20190425104853.6d67b24.el7.noarch
openstack-tripleo-heat-templates-9.3.1-0.20190531051851.bb4fb9d.el7.noarch
python-tripleoclient-heat-installer-10.6.2-0.20190522234411.12e3f68.el7.noarch
python2-tripleo-repos-0.0.1-0.20190520152004.8a48b48.el7.noarch
openstack-tripleo-validations-9.3.2-0.20190523001404.bf11998.el7.noarch
puppet-tripleo-9.4.1-0.20190601011855.c5986da.el7.noarch

Anton Antonov (anta-nok)
description: updated
Revision history for this message
John Fulton (jfulton-org) wrote :

Anton,

The Swift URLs which are generated by the workflow expire and that's why you got 401 unauthorized.

As per the tail of the URL temp_url_expires=1560282811, which when converted from epoch time to human time in GMT at that moment is Tuesday, June 11, 2019 19:53:31. We can see your failed task occurred at Tuesday 11 June 2019 22:44:07 +0100.

New URLs are generated when `openstack overcloud external-upgrade run` is executed and more details are in:

 https://github.com/openstack/tripleo-common/commit/d1619ed9eac7ebbf8d8efae1476e1981d0a980e4

The URLs should be good for 1 hour:

 https://github.com/openstack/tripleo-common/blob/master/workbooks/deployment.yaml#L633

So it must have taken >1 hour to get to this state. We should probably increase the timeout.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.opendev.org/664942

Changed in tripleo:
assignee: nobody → John Fulton (jfulton-org)
status: New → In Progress
Changed in tripleo:
importance: Undecided → High
milestone: none → train-2
tags: added: queens-backport-potential
Revision history for this message
Anton Antonov (anta-nok) wrote :

John,

  It's very likely that the first run of `openstack overcloud external-upgrade run --tags ceph` took more than an hour. But, after that I run it again a few times and every time it failed in less than 2min.
  It looks like when `openstack overcloud external-upgrade run --tags ceph` is executed the new URLs are not generated. But, used ones generated on some previous upgrade step. Which one? 'overcloud update prepare'? Please have a look at the full output of of of these failed runs attached.

This is the full list of commands I run (and some of them of course took more than a few hours to run):

'''
openstack overcloud update prepare --templates ...

openstack overcloud external-update run --tags container_image_prepare

openstack overcloud update run --nodes Controller

openstack overcloud update run --nodes some,compute,nodes
openstack overcloud update run --nodes other,compute,nodes
openstack overcloud update run --nodes more,compute,nodes
openstack overcloud update run --nodes even,more,compute,nodes

openstack overcloud update run --nodes CephStorage

openstack overcloud external-update run --tags ceph

# yet to be run
#openstack overcloud external-update run --tags online_upgrade

#openstack overcloud update converge --templates ...
```

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/664942
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=8f2e1064a2473a8218056ecfd932c714b11e0fe6
Submitter: Zuul
Branch: master

commit 8f2e1064a2473a8218056ecfd932c714b11e0fe6
Author: John Fulton <email address hidden>
Date: Wed Jun 12 10:24:02 2019 -0400

    Increase timeout of temp swift URLs from 1 to 4 hours

    It can take longer than 1 hour for a temporary swift
    URL which stores the backup of the fetch directory to
    be created vs used. Increase the timeout from 1 hour
    to 4 hours so that the URL does not expire before it
    is used.

    Change-Id: Ib989bd673c694a9dc5af2a4a63ed84888f102a50
    Fixes-Bug: #1832509

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/665670

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/665671

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/stein)

Reviewed: https://review.opendev.org/665670
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=5cdd46920b0eee5cb2cf76e0d5f957f5ce057da7
Submitter: Zuul
Branch: stable/stein

commit 5cdd46920b0eee5cb2cf76e0d5f957f5ce057da7
Author: John Fulton <email address hidden>
Date: Wed Jun 12 10:24:02 2019 -0400

    Increase timeout of temp swift URLs from 1 to 4 hours

    It can take longer than 1 hour for a temporary swift
    URL which stores the backup of the fetch directory to
    be created vs used. Increase the timeout from 1 hour
    to 4 hours so that the URL does not expire before it
    is used.

    Change-Id: Ib989bd673c694a9dc5af2a4a63ed84888f102a50
    Fixes-Bug: #1832509
    (cherry picked from commit 8f2e1064a2473a8218056ecfd932c714b11e0fe6)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/rocky)

Reviewed: https://review.opendev.org/665671
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=a7eccfdf1d520825010f76b6cd0eab34e30c5610
Submitter: Zuul
Branch: stable/rocky

commit a7eccfdf1d520825010f76b6cd0eab34e30c5610
Author: John Fulton <email address hidden>
Date: Wed Jun 12 10:24:02 2019 -0400

    Increase timeout of temp swift URLs from 1 to 4 hours

    It can take longer than 1 hour for a temporary swift
    URL which stores the backup of the fetch directory to
    be created vs used. Increase the timeout from 1 hour
    to 4 hours so that the URL does not expire before it
    is used.

    Change-Id: Ib989bd673c694a9dc5af2a4a63ed84888f102a50
    Fixes-Bug: #1832509
    (cherry picked from commit 8f2e1064a2473a8218056ecfd932c714b11e0fe6)

tags: added: in-stable-rocky
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.