[19.04] keystone leader unit sometimes fails to add endpoints when "certificates" relation is present but certs not installed and TLS not configured.

Bug #1827668 reported by Dmitrii Shcherbakov
34
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Keystone Charm
Triaged
High
Unassigned

Bug Description

1) deployed an HA bundle without TLS enabled;
2) related all openstack services to vault via "certificates" relation;
3) observed keystone trying to use http://localhost:35337 which was not set up yet.

LISTEN 0 128 *:35347 *:* users:(("apache2",pid=646064,fd=8),("apache2",pid=646063,fd=8),("apache2",pid=646058,fd=8))
LISTEN 0 128 *:35357 *:* users:(("haproxy",pid=652648,fd=8))

https://paste.ubuntu.com/p/ftQZnC7DCz/
unit-keystone-0: 18:23:09 DEBUG unit.keystone/0.identity-service-relation-changed keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://localhost:35337/v3/services?
# ...
unit-keystone-0: 18:23:09 ERROR juju.worker.uniter.operation hook "identity-service-relation-changed" failed: exit status 1

This keystone unit tried to add new (https) endpoints of other openstack services into the catalog using the following code (clustered & https => public_port - 20 = 35337):
https://github.com/juju/charm-helpers/blob/stable/19.04/charmhelpers/contrib/hahelpers/cluster.py#L244-L262

However, /etc/apache2/sites-enabled/wsgi-openstack-api.conf still contains

Listen 35347
Listen 4990
<VirtualHost *:35347>
# ...

juju show-status-log keystone/0 --days 1
Time Type Status Message
03 May 2019 11:24:08Z workload waiting waiting for machine
03 May 2019 11:24:08Z juju-unit allocating
03 May 2019 11:42:56Z workload waiting installing agent
03 May 2019 11:42:59Z workload waiting agent initializing
03 May 2019 11:43:04Z workload maintenance installing charm software
03 May 2019 11:43:04Z juju-unit executing running install hook
03 May 2019 11:43:56Z workload maintenance Executing pre-install
03 May 2019 11:44:01Z workload maintenance Installing apt packages
03 May 2019 11:47:36Z juju-unit executing running leader-elected hook
03 May 2019 11:47:52Z juju-unit executing running config-changed hook
03 May 2019 11:49:22Z juju-unit executing running start hook
03 May 2019 11:49:32Z workload blocked Missing relations: database
03 May 2019 11:49:36Z juju-unit executing running cluster-relation-joined hook
03 May 2019 11:53:21Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 11:53:36Z juju-unit executing running commands
03 May 2019 11:54:36Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 11:55:27Z juju-unit executing running commands
03 May 2019 11:55:46Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 11:56:13Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 11:57:43Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 11:58:25Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 11:58:53Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 11:59:07Z juju-unit executing running cluster-relation-changed hook
03 May 2019 11:59:30Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 11:59:44Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 12:00:13Z juju-unit executing running commands
03 May 2019 12:00:20Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 12:01:03Z juju-unit executing running shared-db-relation-joined hook
03 May 2019 12:01:19Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 12:01:35Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 12:01:49Z juju-unit executing running ha-relation-joined hook
03 May 2019 12:02:37Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 12:02:51Z juju-unit executing running ha-relation-changed hook
03 May 2019 12:03:10Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 12:03:26Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 12:03:41Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 12:03:57Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 12:04:28Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 12:05:13Z workload waiting Incomplete relations: database
03 May 2019 12:05:15Z juju-unit executing running commands
03 May 2019 12:05:23Z juju-unit executing running shared-db-relation-changed hook
03 May 2019 12:08:11Z juju-unit executing running shared-db-relation-joined hook
03 May 2019 12:08:28Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 12:08:43Z juju-unit executing running shared-db-relation-changed hook
03 May 2019 12:11:48Z juju-unit executing running commands
03 May 2019 12:11:59Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 12:13:04Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 12:13:20Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 12:15:13Z juju-unit executing running commands
03 May 2019 12:15:19Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 12:16:44Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 12:17:15Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 12:19:06Z juju-unit executing running identity-service-relation-joined hook
03 May 2019 12:19:23Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 12:19:46Z juju-unit idle
03 May 2019 12:20:02Z juju-unit executing running commands
03 May 2019 12:20:08Z juju-unit idle

#...

03 May 2019 17:40:02Z juju-unit executing running commands
03 May 2019 17:40:08Z juju-unit idle
03 May 2019 17:45:01Z juju-unit executing running commands
03 May 2019 17:45:08Z juju-unit idle
03 May 2019 17:45:35Z juju-unit executing running certificates-relation-joined hook
03 May 2019 17:45:53Z juju-unit executing running certificates-relation-changed hook
03 May 2019 17:46:11Z juju-unit executing running certificates-relation-joined hook
03 May 2019 17:46:25Z workload active Unit is ready
03 May 2019 17:46:40Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 17:47:44Z juju-unit error hook failed: "identity-service-relation-changed"
03 May 2019 17:47:49Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 17:48:47Z juju-unit error hook failed: "identity-service-relation-changed"
03 May 2019 17:48:52Z juju-unit executing running action juju-run
03 May 2019 17:48:52Z juju-unit error hook failed: "identity-service-relation-changed"
03 May 2019 17:48:56Z juju-unit executing running identity-service-relation-changed hook
03 May 2019 17:49:52Z juju-unit error hook failed: "identity-service-relation-changed"
03 May 2019 17:50:02Z juju-unit executing running commands

# keystone/0

root@juju-d252c7-0-lxd-5:/var/lib/juju/agents/unit-keystone-0/charm# relation-ids certificates
certificates:106
root@juju-d252c7-0-lxd-5:/var/lib/juju/agents/unit-keystone-0/charm# relation-list -r certificates:106
vault/0
vault/1
root@juju-d252c7-0-lxd-5:/var/lib/juju/agents/unit-keystone-0/charm# relation-get -r certificates:106 - vault/0
ca: |-
  -----BEGIN CERTIFICATE-----
#...
  -----END CERTIFICATE-----
client.cert: |-
  -----BEGIN CERTIFICATE-----
# ...
  -----END CERTIFICATE-----
client.key: |-
  -----BEGIN RSA PRIVATE KEY-----
# ...
  -----END RSA PRIVATE KEY-----
egress-subnets: 10.232.7.208/32
ingress-address: 10.232.7.208
private-address: 10.232.7.208

root@juju-d252c7-0-lxd-5:/var/lib/juju/agents/unit-keystone-0/charm# relation-get -r certificates:106 - vault/1
egress-subnets: 10.232.7.234/32
ingress-address: 10.232.7.234
private-address: 10.232.7.234

The unit log reported "Unit is ready" after certificates-relation-changed:

2019-05-03 17:46:09 DEBUG certificates-relation-changed active
2019-05-03 17:46:09 DEBUG certificates-relation-changed active
2019-05-03 17:46:09 DEBUG certificates-relation-changed active
2019-05-03 17:46:09 INFO juju-log certificates:106: Unit is ready

root@juju-d252c7-0-lxd-5:~# ls -al /etc/apache2/ssl/keystone/
total 8
dr-xr-xr-x 2 root root 4096 May 3 17:46 .
dr-xr-xr-x 3 root root 4096 May 3 17:46 ..

The second keystone unit has everything (certs, WSGI daemon listening on the correct port etc.).

While trying to replay what went wrong I can see that processed_requests are not exposed for keystone/0:

ipdb> certs = data.get('{}.processed_requests'.format(name))
ipdb> certs

It is not clear to me how this unit got into that state.

It is worth noting that the lab deployment used vault + totally-insecure-unlock with 2 Vault units one of which (non-leader) could not unseal itself without manual intervention.

I will leave this for reference in the 'incomplete' state because this may be quite difficult to reproduce or debug.

Tags: cpe-onsite
Changed in charm-keystone:
status: New → Incomplete
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.2 KiB)

Reproduced it on a clean 19.04+Queens deployment without totally-insecure-unlock (did a manual unlocking procedure by hand):

VAULT_UNIT_IP=$(juju run --unit vault/0 "network-get access --ingress-address=true"); export VAULT_ADDR="http://$VAULT_UNIT_IP:8200"

vault operator init -key-shares=1 -key-threshold=1 > bundles/vault.txt
vault operator unseal <key-from-vault-txt>

VAULT_UNIT_IP=$(juju run --unit vault/1 "network-get access --ingress-address=true"); export VAULT_ADDR="http://$VAULT_UNIT_IP:8200"
vault operator unseal <key>

export VAULT_TOKEN=<initial-root-token-from-vault.txt>
vault token create --ttl=10m

juju export-bundle
https://pastebin.canonical.com/p/cCBN5PnYRx/

juju show-status-log keystone/0 --days 1
https://paste.ubuntu.com/p/36fJJpXJ4Q/

keystone/0* error idle 0/lxd/5 10.232.46.164 5000/tcp hook failed: "identity-service-relation-changed"
  hacluster-keystone/1 active idle 10.232.46.164 Unit is ready and clustered
  keystone-saml-mellon/1 active idle 10.232.46.164 Unit is ready
keystone/1 active idle 1/lxd/5 10.232.46.157 5000/tcp Unit is ready
  hacluster-keystone/0* active idle 10.232.46.157 Unit is ready and clustered
  keystone-saml-mellon/0* active idle 10.232.46.157 Unit is ready

https://private-fileshare.canonical.com/~dima/charm-dumps/10-05-2019-keystone-0-var-log-etc.tar.gz

https://private-fileshare.canonical.com/~dima/charm-dumps/10-05-2019-var-lib-juju-agents-keystone-0-hacluster.tar.gz

sqlite3 /var/lib/juju/agents/unit-keystone-0/charm/.unit-state.db
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
sqlite> select * from kv;
charm_revisions|["0"]
env|{"CHARM_DIR": "/var/lib/juju/agents/unit-keystone-0/charm", "JUJU_CHARM_DIR": "/var/lib/juju/agents/unit-keystone-0/charm", "JUJU_CONTEXT_ID": "keystone/0-identity-service-relation-changed-866489089749738405", "JUJU_AGENT_SOCKET": "@/var/lib/juju/agents/unit-keystone-0/agent.socket", "JUJU_UNIT_NAME": "keystone/0", "JUJU_MODEL_UUID": "bfd89e39-f481-4e7f-8a1c-fd9f4a35b898", "JUJU_MODEL_NAME": "openstack", "JUJU_API_ADDRESSES": "10.232.1.60:17070", "JUJU_SLA": "unsupported", "JUJU_MACHINE_ID": "0/lxd/5", "JUJU_PRINCIPAL_UNIT": "", "JUJU_AVAILABILITY_ZONE": "default", "JUJU_VERSION": "2.6-rc2", "CLOUD_API_VERSION": "", "JUJU_CHARM_HTTP_PROXY": "http://10.232.0.1:3128", "JUJU_CHARM_HTTPS_PROXY": "http://10.232.0.1:3128", "JUJU_CHARM_FTP_PROXY": "", "JUJU_CHARM_NO_PROXY": "10.0.0.0/8,192.168.0.0/16,172.16.0.0/12", "JUJU_METER_STATUS": "AMBER", "JUJU_METER_INFO": "not set", "JUJU_RELATION": "identity-service", "JUJU_RELATION_ID": "identity-service:38", "JUJU_REMOTE_UNIT": "cinder/1", "APT_LISTCHANGES_FRONTEND": "none", "DEBIAN_FRONTEND": "noninteractive", "PATH": "/var/lib/juju/tools/unit-keystone-0:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"}
unit|"keystone/0"
relid|"identity-service:38"
stat-password|"<redacted>"
fid-restart-nonce-keystone-fid-service-provider:92|...

Read more...

Changed in charm-keystone:
status: Incomplete → Confirmed
summary: - [19.04] a keystone unit may end up using an incorrect localhost port
- after certificates relation is added
+ [19.04] keystone leader unit sometimes fails to add endpoints when
+ certificates relation is present
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote : Re: [19.04] keystone leader unit sometimes fails to add endpoints when certificates relation is present

Subscribed ~field-high.

The issue seems to be a race condition that depends on relative ordering of keystone relation-scoped events and is reproducible in two cases:

1) set up a cloud without https and add "certificates" relation to vault later;
2) set up a cloud with "certificates" relations right away and unlock vault after the model mostly settles.

summary: [19.04] keystone leader unit sometimes fails to add endpoints when
- certificates relation is present
+ "certificates" relation is present
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote : Re: [19.04] keystone leader unit sometimes fails to add endpoints when "certificates" relation is present
Download full text (3.4 KiB)

The difference between the deployments where it goes well and where it does not is the time when a database migration is triggered (shared-db-relation-changed).

1) there is no error if a database migration is done after the certificates relation is handled and TLS is set up for Keystone;
2) the error occurs when a database migration is done before TLS is set up (this is equivalent to setting up keystone without TLS and then adding it).

The issue can be fixed by (1) introducing the usage of the goal-state hook tool for the new deployment case and (2) properly handling the conversion from running without TLS to running with TLS.

show-status-log for the lead keystone unit with a reproducer:
https://paste.ubuntu.com/p/36fJJpXJ4Q/ (full log)

10 May 2019 10:13:48Z juju-unit executing running start hook
10 May 2019 10:13:54Z workload blocked Missing relations: database <-----------------
# ...
10 May 2019 10:17:27Z juju-unit executing running shared-db-relation-changed hook
# ...
10 May 2019 10:19:45Z juju-unit executing running shared-db-relation-joined hook <-----------------
# ...
10 May 2019 10:25:52Z juju-unit executing running shared-db-relation-changed hook
10 May 2019 10:26:05Z workload maintenance Migrating the keystone database <-----------------
10 May 2019 10:27:06Z juju-unit executing running identity-service-relation-changed hook
10 May 2019 10:27:22Z juju-unit executing running shared-db-relation-changed hook
# ...

10 May 2019 10:38:52Z juju-unit executing running certificates-relation-joined hook <-----------------
10 May 2019 10:39:08Z juju-unit executing running identity-service-relation-joined hook
10 May 2019 10:39:21Z workload active Unit is ready
10 May 2019 10:39:23Z juju-unit executing running identity-service-relation-changed hook
10 May 2019 10:40:19Z juju-unit error hook failed: "identity-service-relation-changed" <-----------------

without a reproducer (different deployment):

https://paste.ubuntu.com/p/qX8csvqRWt/ (full log)

10 May 2019 21:32:23Z juju-unit executing running identity-service-relation-joined hook
10 May 2019 21:34:44Z juju-unit executing running certificates-relation-joined hook <-----------------
10 May 2019 21:35:03Z juju-unit executing running identity-service-relation-changed hook
10 May 2019 21:36:09Z juju-unit executing running certificates-relation-changed hook <-----------------
10 May 2019 21:36:33Z juju-unit executing running identity-service-relation-joined hook
10 May 2019 21:37:09Z juju-unit executing running keystone-fid-service-provider-relation-joined hook
10 May 2019 21:37:40Z juju-unit executing running identity-service-relation-joined hook
10 May 2019 21:38:35Z juju-unit executing running certificates-relation-joined hook <-----------------

# ...

10 May 2019 21:42:57Z juju-unit executing running certificates-relation-changed hook <-----------------

# ...

10 May 2019 21:51:07Z workload waiting Incomplete relations: database
10 May 2019...

Read more...

Revision history for this message
Przemyslaw Hausman (phausman) wrote :

subscribed ~field-critical

raising to field-critical because I'm facing this bug on an ongoing customer deployment

Ryan Beisner (1chb1n)
Changed in charm-keystone:
milestone: none → 19.07
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Thank you for your work on this.

I think there are two separate issues intertwined here.

A) Expectation for non-tls to tls migration:

1. A cloud exists.
2. Deploy vault.
3. Relate vault.
4. Use juju-wait to let everything except for vault settle.
5. Then unseal vault.

If it fails with that approach, please re-raise the bug.

B) Expectation for fresh deployments:

1. Deploy bundle which defines all of the applications, all of the relations.
2. Use juju-wait to let everything except for vault settle.
3. Unseal vault.
4. Then proceed.

If it fails with that approach, please re-raise the bug.

Ryan Beisner (1chb1n)
Changed in charm-keystone:
assignee: nobody → Ryan Beisner (1chb1n)
status: Confirmed → Incomplete
Revision history for this message
Przemyslaw Hausman (phausman) wrote :

Unsubscribing field-critical. Customer decided not to go with Vault.

Revision history for this message
James Page (james-page) wrote :

There is still a race condition in the keystone charm relating to when it bumps the internal port to 35337 which it does prior to the certificates having been written to disk and the TLS endpoints being configured in Apache.

We should fix this irrespective of the workaround documented in #5

summary: [19.04] keystone leader unit sometimes fails to add endpoints when
- "certificates" relation is present
+ "certificates" relation is present but certs not installed and TLS not
+ configured.
David Ames (thedac)
Changed in charm-keystone:
milestone: 19.07 → 19.10
David Ames (thedac)
Changed in charm-keystone:
milestone: 19.10 → 20.01
James Page (james-page)
Changed in charm-keystone:
milestone: 20.01 → 20.05
Revision history for this message
James Page (james-page) wrote :

I just hit this issue again so going to work on a fix.

Changed in charm-keystone:
status: Incomplete → Confirmed
importance: Undecided → High
assignee: Ryan Beisner (1chb1n) → James Page (james-page)
status: Confirmed → In Progress
James Page (james-page)
Changed in charm-keystone:
assignee: James Page (james-page) → nobody
David Ames (thedac)
Changed in charm-keystone:
milestone: 20.05 → 20.08
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Something related seems to be happening a lot on OSCI [0] at the moment. The test thinks keystone is ready and is trying to download via scp the certificates from a keystone unit. However the certificates haven't been written on the unit's filesystem yet.

juju.errors.JujuError: command failed: ['scp', '-i', '/var/lib/jenkins/.local/share/juju/ssh/juju_id_rsa', '-o', 'StrictHostKeyChecking=no', '-q', '-B', 'ubuntu@172.17.107.48:/usr/local/share/ca-certificates/keystone_juju_ca_cert.crt', 'tests/keystone_juju_ca_cert.crt']

0: http://osci:8080/view/MojoMatrix/job/mojo_runner/22973/console

Changed in charm-keystone:
assignee: nobody → Alex Kavanagh (ajkavanagh)
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

As Aurelien has noted, this is now a consistent bug on mojo tests in designate-ha. I've run the test manually, and noticed that the mojo tests starts the vault unlocking script *before* keystone has finished it's final set of relation changes. This sort of confirms what Aurelien is saying.

Therefore, the strategy for the mojo tests (at least) is to work out why it thinks the model is ready, but keystone hasn't actually got to the stage of writing certificates; it's probably a change in keystone in delaying writing certificates (because they weren't available) but the mojo code still thinks it can copy stuff in.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

comments https://bugs.launchpad.net/charm-keystone/+bug/1827668/comments/10 and https://bugs.launchpad.net/charm-keystone/+bug/1827668/comments/9 are unrelated to this bug. That was due to [1] which broken mojo as the relative 'tests/' directory did not exist to put the keystone_juju_ca_cert.crt file in.

[1] https://github.com/openstack-charmers/zaza-openstack-tests/commit/537473ad3addeaa3d1e4e2d0fd556aeaa4018eb2

Changed in charm-keystone:
assignee: Alex Kavanagh (ajkavanagh) → nobody
James Page (james-page)
Changed in charm-keystone:
milestone: 20.08 → none
Changed in charm-keystone:
status: In Progress → Triaged
Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

I'm getting this exact problem with focal/wallaby with latest charms today.

One keystone unit goes into error state on hook identity-service-relation-changed and executing the hook by hand shows failing to connect to localhost:35337.

Because of specific project needs I'm doing things in this order:

- install ceph + vault first
- wait for them to settle
- unlock vault
- init CA (using auto option)
- wait for it to settle
- deploy the usual bundle for openstack with all the rest (using certificate relations for ssl)
- wait for it to settle

At this moment, the error happens.

Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

Sometimes it fails (at the same point -- in only one unit) in hook "config-changed" with logs showing:

unit-keystone-3: 23:04:51 ERROR unit.keystone/3.juju-log The call within manager.py failed with the error: 'Internal Server Error (HTTP 500)'. The call was: path=['resolve_domain_id'], args=('default',), kwargs={}, api_version=None
unit-keystone-3: 23:04:51 INFO unit.keystone/3.juju-log Retrying '_proxy_manager_call' 5 more times (delay=3)
unit-keystone-3: 23:04:55 ERROR unit.keystone/3.juju-log The call within manager.py failed with the error: 'Internal Server Error (HTTP 500)'. The call was: path=['resolve_domain_id'], args=('default',), kwargs={}, api_version=None

Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

It's always the leader that fails. Adding more units while the one in error is still there is no good (it will deploy but the broken one will still be the one processing relation data). Deleting the one in error state will leave no leader.

With all other charms deployed, removing all keystone units and adding 3 units again, goes into the same state, the leader one enters into error as soon as it finishes installing (the other two go to active/idle).

Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

I tested deploying only a very simple bundle with only keystone instead of whole openstack (after first step with ceph+vault+mysql-innodb-cluster as described earlier).

So, to be more specific:

- install ceph + vault + mysql-innodb-cluster
- wait for them to settle
- unlock vault
- init CA (using auto option)
- wait for it to settle
- deploy a very small bundle with only keystone + hacluster-keystone + keystone-mysql-router
- wait for it to settle
- deploy rest of openstack

Interesting is that just after deploying (with this small keystone bundle) all three keystone units go active/idle.

Then, after I deploy the rest of openstack and other charms start relating to keystone, then one of the keystone units fail.

Unit Workload Agent Machine Public address Ports Message
keystone/0* error idle 3/lxd/3 100.126.1.30 5000/tcp hook failed: "identity-service-
relation-changed"

It cycles between "updating endpoint for xxx" and "hook failed: identity-service-relation-changed".

Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

When deploying only one unit, without hacluster, it initiates the database migration:

keystone/3* maintenance executing 3/lxd/24 100.126.1.98 5000/tcp Migrating the keystone database

And immediately after goes into error state:

keystone/3* error idle 3/lxd/24 100.126.1.98 5000/tcp hook failed: "domain-backend-relation-changed"

Basically same behavior that when using 3 units.

Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):
Download full text (4.1 KiB)

I tested unsealing vault but _not_ starting the CA. Keystone started updating endpoints and then went into error state, this time a "cluster-relation-changed". No SSL certificates had been emited yet.

keystone/0* error idle 4/lxd/13 100.126.0.181 5000/tcp hook failed: "cluster-relation-changed"

Running the hook by hand inside a debug-hook session, the result was very similar to the error descibed in comment #13.

root@juju-f8838a-4-lxd-13:/var/lib/juju/agents/unit-keystone-0/charm# ./hooks/cluster-relation-changed
Internal Server Error (HTTP 500)
Internal Server Error (HTTP 500)
Internal Server Error (HTTP 500)
Internal Server Error (HTTP 500)
Internal Server Error (HTTP 500)
Internal Server Error (HTTP 500)
Traceback (most recent call last):
  File "./hooks/cluster-relation-changed", line 937, in <module>
    main()
  File "./hooks/cluster-relation-changed", line 930, in main
    hooks.execute(sys.argv)
  File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/core/hookenv.py", line 956, in execute
    self._hooks[hook_name]()
  File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/utils.py", line 1903, in wrapped_f
    return restart_on_change_helper(
  File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/core/host.py", line 837, in restart_on_change_helper
    r = lambda_f()
  File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/utils.py", line 1904, in <lambda>
    (lambda: f(*args, **kwargs)),
  File "./hooks/cluster-relation-changed", line 547, in cluster_changed
    update_all_identity_relation_units()
  File "./hooks/cluster-relation-changed", line 340, in update_all_identity_relation_units
    identity_changed(relation_id=rid, remote_unit=unit)
  File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/utils.py", line 1903, in wrapped_f
    return restart_on_change_helper(
  File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/core/host.py", line 837, in restart_on_change_helper
    r = lambda_f()
  File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/contrib/openstack/utils.py", line 1904, in <lambda>
    (lambda: f(*args, **kwargs)),
  File "./hooks/cluster-relation-changed", line 445, in identity_changed
    add_service_to_keystone(relation_id, remote_unit)
  File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/keystone_utils.py", line 1895, in add_service_to_keystone
    add_endpoint(region=settings['region'],
  File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/keystone_utils.py", line 2116, in add_endpoint
    create_service_entry(service, service_type, desc,
  File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/keystone_utils.py", line 1001, in create_service_entry
    list_services = manager.list_services()
  File "/var/lib/juju/agents/unit-keystone-0/charm/hooks/keystone_utils.py", line 1226, in __call__
    return _proxy_manager_call(self._path, self.api_version, args, kwargs)
  File "/var/lib/juju/agents/unit-keystone-0/charm/charmhelpers/core/decorators.py", line 40, in _retry_on_exception_inner_2
    return f(*args, **kwargs)
  File "/var/lib/juju/agents/unit-keystone-0/charm/...

Read more...

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Subscribing field critical because this is affecting/blocking a customer deployment.

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Unsubscribing field-critical. Problem is solved. Turned out to be a syntax error in the ldap config which was making the whole keystone crash at a specific point.

Revision history for this message
Ksawery Dziekoński (ksdziekonski) wrote :

Just throwing in a potential resolution to this - once all non-leading/follower Keystone units have settled into an idle, failing over the current leader in blocked state by stopping it's machine unit service and starting it back once leadership has been handed over to a different unit seems to mitigate the issue.

Revision history for this message
Paul Goins (vultaire) wrote :

I also hit the above issue on a single keystone unit (the leader), and the above workaround did work for me as well.

Revision history for this message
Samuel Walladge (swalladge) wrote :

Still an issue today on keystone ussuri/stable channel, rev 599.

Revision history for this message
Corey Bryant (corey.bryant) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.