vault fails to start when MySQL backend down

Bug #1818973 reported by James Page
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
vault-charm
Triaged
High
Unassigned

Bug Description

Whilst performing some full outage reboot testing, I noticed that vault daemons sometimes fail to start if the backend MySQL database is not contactable.

We should probably tune the systemd unit to deal with this as vault itself does not seem to be able to configure a retries or suchlike.

Tags: cold-start seg
James Page (james-page)
Changed in vault-charm:
status: New → Triaged
importance: Undecided → Medium
Trent Lloyd (lathiat)
tags: added: seg
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

This can be reproduced with cs:vault-35 :

1. Deploy, initialize and unseal a vault:

juju deploy cs:vault-35
juju deploy mysql
juju relate vault mysql
export VAULT_ADDR="http://$(juju run --unit vault/0 'unit-get private-address'):8200"
vault operator init -key-shares=5 -key-threshold=2
vault operator unseal A1I4gVtqqFoDBEoQznosX+kmCnFRqOiNhq4Xq5GZtR9y
vault operator unseal K/ASAolWEA1ngDidJ+yEsUP1q6mK4I5tK2GRH+RrRQsv
export VAULT_TOKEN=s.sNyy3wrtrNoVUDj0NCWmMyUd
vault token create -ttl=10m
juju run-action --wait vault/0 authorize-charm token=s.vNrCZl8c9qUXueQhgqrr0kQP

2. Pause the vault and stop mysql:

juju run-action --wait vault/0 pause # -> blocked: Vault service not running
juju run --unit mysql/0 -- systemctl stop mysql

3. Resume the vault:

juju run-action --wait vault/0 resume # -> still blocked: Vault service not running

Logs show:

2020-03-09 14:34:17 ERROR juju-log Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/urllib3/connection.py", line 157, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/var/lib/juju/agents/unit-vault-0/.venv/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

4. Resurrect mysql:

juju run --unit mysql/0 -- systemctl start mysql

Expected: the vault service resurrects at some point
Actual: the vault service remains stopped until the operator performs a 'resume' action again.

Changed in vault-charm:
status: Triaged → Confirmed
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

It is also possible to reach "blocked: Vault health check failed" instead of "blocked: Vault service not running"

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

In the case of "blocked: Vault health check failed", the vault service is still running and able to recover when MySQL gets back up. In the case of "blocked: Vault service not running", it won't recover on its own but systemd has detected the service as failed, so indeed it should be possible to solve the issue by tuning Restart*= on the systemd unit.

tags: added: cold-start
Changed in vault-charm:
status: Confirmed → Triaged
importance: Medium → High
Revision history for this message
Trent Lloyd (lathiat) wrote :

I filed Bug #1871539 to handle the problem where the charm gets stuck in the blocked state, and you cannot get out of it (even with the resume action or any other juju incantation) - due to faults in the flag logic.

This bug can potentially deal with the general problem of not retrying for MySQL for a while.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.