Bug #1775864 “kolla extend_start.sh doesn't check the start of M...” : Bugs : kolla

Sergii Golovatiuk (sgolovatiuk) on 2018-06-08

Changed in kolla:
status:	New → Confirmed

Sergii Golovatiuk (sgolovatiuk) on 2018-06-08

Changed in tripleo:
assignee:	nobody → Sergii Golovatiuk (sgolovatiuk)
importance:	Undecided → Medium
importance:	Medium → High
status:	New → Confirmed
milestone:	none → rocky-3

Bogdan Dobrelya (bogdando) on 2018-06-11

no longer affects:	tripleo/pike
no longer affects:	tripleo/queens
Changed in tripleo:
status:	Confirmed → Triaged

Revision history for this message

Damien Ciabrini (dciabrin) wrote on 2018-06-13:

#1

Download full text (5.4 KiB)

[Current analysis of what failed in the CI job - sorry for the long post]

TL;DR using "mysqladmin ping" is probably a better way to ensure mysqld started properly during kolla bootstrap.

There's a small typo in the link in the description, the failure log is available at [1].

Quick tripleo refresher: during undercloud install, we bootstrap the mysql db by running the kolla bootstrap script [2] in a transient container (before that time no DB existed yet).

The bootstrapping happens in two steps: run mysql_install_db, and run some kolla-specific commands to set up root password.

extend_start.sh is sourced by kolla_start, which has +x and +e flags, so we trace all shell commands and stop on first error.

1) From the install log, we see that mysql_install_db went ok, because we continue and call bootstrap_db:

2018-06-05 16:24:34 | "++ mysql_install_db",
2018-06-05 16:24:34 | "2018-06-05 16:24:14 139787632818368 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 | "2018-06-05 16:24:14 139787632818368 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 46 ...",
2018-06-05 16:24:34 | "2018-06-05 16:24:18 139859719940288 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 | "2018-06-05 16:24:18 139859719940288 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 75 ...",
2018-06-05 16:24:34 | "2018-06-05 16:24:21 139917192714432 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 | "2018-06-05 16:24:21 139917192714432 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 104 ...",
2018-06-05 16:24:34 | "++ bootstrap_db",

(internally this caused three transient mysqld run as seen in [3])
2018-06-08 7:33:11 139673585006784 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 0
2018-06-08 7:33:15 139700819671232 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622818
2018-06-08 7:33:19 139684735236288 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622828

Finishing OK means any pidfile and sock file on disk were deleted when mysqld_safe stopped.

2) Then comes the kolla-specific initialization.
. In that step, we start a mysql server _without network connectivity_ and _witout galera replication_.
. A while loops [4] waits until it sees that both the mysql{d}.sock unix socket _and_ /var/lib/mysql/mariadb.pid are created.

The while loop is a fairly good guarantee that mysql is up because mysqld_safe force deletes any lingering /var/lib/mysql/mariadb.pid [5] before starting the mysqld server. Also mysql{d}.sock is always deleted and recreated because of the way named UNIX socket are initialized.

I've tested it manually and it seems that the while loop only finished when both the socket and the pidfile are created.

From mariadb logs [3], we see that the server properly started at 7:33...

[Current analysis of what failed in the CI job - sorry for the long post]

TL;DR using "mysqladmin ping" is probably a better way to ensure mysqld started properly during kolla bootstrap.

There's a small typo in the link in the description, the failure log is available at [1].

Quick tripleo refresher: during undercloud install, we bootstrap the mysql db by running the kolla bootstrap script [2] in a transient container (before that time no DB existed yet).

The bootstrapping happens in two steps: run mysql_install_db, and run some kolla-specific commands to set up root password.

extend_start.sh is sourced by kolla_start, which has +x and +e flags, so we trace all shell commands and stop on first error.

1) From the install log, we see that mysql_install_db went ok, because we continue and call bootstrap_db:

2018-06-05 16:24:34 |         "++ mysql_install_db",
2018-06-05 16:24:34 |         "2018-06-05 16:24:14 139787632818368 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 |         "2018-06-05 16:24:14 139787632818368 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 46 ...",
2018-06-05 16:24:34 |         "2018-06-05 16:24:18 139859719940288 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 |         "2018-06-05 16:24:18 139859719940288 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 75 ...",
2018-06-05 16:24:34 |         "2018-06-05 16:24:21 139917192714432 [Warning] option 'open_files_limit': unsigned value 18446744073709551615 adjusted to 4294967295",
2018-06-05 16:24:34 |         "2018-06-05 16:24:21 139917192714432 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 104 ...",
2018-06-05 16:24:34 |         "++ bootstrap_db",

(internally this caused three transient mysqld run as seen in [3])
2018-06-08  7:33:11 139673585006784 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 0
2018-06-08  7:33:15 139700819671232 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622818
2018-06-08  7:33:19 139684735236288 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622828

Finishing OK means any pidfile and sock file on disk were deleted when mysqld_safe stopped.

2) Then comes the kolla-specific initialization.
 . In that step, we start a mysql server _without network connectivity_ and _witout galera replication_.
 . A while loops [4] waits until it sees that both the mysql{d}.sock unix socket _and_ /var/lib/mysql/mariadb.pid are created.

The while loop is a fairly good guarantee that mysql is up because mysqld_safe force deletes any lingering /var/lib/mysql/mariadb.pid [5] before starting the mysqld server. Also mysql{d}.sock is always deleted and recreated because of the way named UNIX socket are initialized.

I've tested it manually and it seems that the while loop only finished when both the socket and the pidfile are created.

From mariadb logs [3], we see that the server properly started at 7:33:22

2018-06-08  7:33:22 140670143490240 [Note] /usr/libexec/mysqld: ready for connections.
Version: '10.1.20-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 0  MariaDB Server

But for some reasons, kolla_security_reset [6] (an script based on 'expect') failed to run until the end.
If passes the first expect string (because it doesn't require connection to mysqld), but failed at the second (which requires sending SQL command to mysqld):

2018-06-08 07:33:45 |         "Failed to get 'Set root password?' prompt",

---

It's unclear to me yet who stops the mysqld server right before the kolla script failure:

2018-06-08  7:33:42 140670142007040 [Note] /usr/libexec/mysqld: Normal shutdown

Nor who is starting another mysql after the whole bootstrap container exit in error:

2018-06-08 07:33:45 |         "Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step2', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label', 'config_data={\"start_order\": 1, \"image\": \"docker.io/tripleomaster/centos-binary-

180608 07:33:45 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

(it's not the init container though, because we only start mysqld without network and the last start has network enabled.)

I can only assume from this point that 1) the mysql server could not process SQL command for whatever reason even if it seems up and ready from the logs, and that's what confused expect.

Maybe using a stronger test in the while loop like using "mysqladmin ping" would ensure that we can talk to the DB or fail cleanly before running the next kolla steps.

[1] http://logs.openstack.org/19/572319/3/gate/tripleo-ci-centos-7-undercloud-containers/8099829/logs/undercloud/home/zuul/undercloud_install.log.txt.gz

[2] https://github.com/openstack/kolla/blob/master/docker/mariadb/extend_start.sh

[3] http://logs.openstack.org/19/572319/3/gate/tripleo-ci-centos-7-undercloud-containers/8099829/logs/undercloud/var/log/containers/mysql/mariadb.log.txt.gz

[4] https://github.com/openstack/kolla/blob/master/docker/mariadb/extend_start.sh#L9-L18

[5] https://github.com/MariaDB/server/blob/10.3/scripts/mysqld_safe.sh#L960

[6] https://github.com/openstack/kolla/blob/master/docker/mariadb/security_reset.expect

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-13: Change abandoned on kolla (master)

#2

Change abandoned by Sergii Golovatiuk (<email address hidden>) on branch: master
Review: https://review.openstack.org/573742
Reason: it doesn't solve the initial issue

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-3 → rocky-rc1

Alex Schultz (alex-schultz) on 2018-08-14

Changed in tripleo:
milestone:	rocky-rc1 → stein-1

Revision history for this message

Damien Ciabrini (dciabrin) wrote on 2018-10-23:

#3

Download full text (11.7 KiB)

OK I revisited this bug and I think I understood all the missing bits and strange logs. I'm fairly confident the issue is understood at this point.

TL;DR: under heavy load, the setup of the mariadb container may fail because the mysql user configuration times out before the server has a chance to evaluate kolla's SQL script.

If your want the full story, brace yourself and read the absurdly long explanation below...

In my previous comment I had one comment wrong, and I couldn't figure out why mariadb was stopped a couple of seconds _after_ the bootstrap script failed. Let me clear out all the doubts here.

Logs used to analyze the failed CI jobs are available here:
. undercloud installation log [1]
. undercloud's journal [2]
. mariadb log [3]

0) First a refresher:
---------------------

In a tripleo-deployed stack, the mysql server runs three times throughout deployment:

* during deployment step 2 - it runs twice in a transient container 'mysql_bootstrap'
  1. mysql server is first started for the kolla bootstrap [4], without network, without wsrep replication
     it's started to set root password, delete default users, and then stopped
  2. another mysql server is started by t-h-t [5], without network, without wsrep replication
     it's started to create a healcheck user, and then stopped

* during deployment step 3 - it runs the database service in container 'mysql'
3. it's started by t-h-t, with network. this one is never stopped

undercloud_install.log [1] errors out at 7:33:45 and dumps all the stdout and stderr of the container since it started at 7:33:07.

Now let me update the incomplete/incorrect explanations I gave in the comment #1. The complete sequence of events is the following:

1) Kolla prepares the mysql db on disk
--------------------------------------

Around 7:33:08, during the kolla bootstrap, the first thing that runs is mysql_install_db, which internally invokes InnoDB three times to create all the files on disk, but it _ doesn't start _ mysqld just yet..

Jun 08 07:33:08 undercloud.localdomain dockerd-current[28532]: ++ mysql_install_db
Jun 08 07:33:08 undercloud.localdomain dockerd-current[28532]: Installing MariaDB/MySQL system tables in '/var/lib/mysql' ...

the three InnoDB initialization (not to be confused with the 3 runs of mysqld) show up in the mariadb.log:

2018-06-08 7:33:11 139673585006784 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 0
2018-06-08 7:33:15 139700819671232 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622818
2018-06-08 7:33:19 139684735236288 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.34-79.1 started; log sequence number 1622828

At this point, the DB is created on disk, mysqld _didn't_ start yet, and no pidfile nor mysqld.sock has been created.

That part completes successfully.

2) Kolla spawns mysql server and _correctly_ waits for it to be available
-------------------------------------------------------------------------

Kolla's bash function 'bootstrap_db' [4] starts at 7:33:21, and schedules a background bash command to start the mysql server

the next...

OK I revisited this bug and I think I understood all the missing bits and strange logs. I'm fairly confident the issue is understood at this point.

TL;DR: under heavy load, the setup of the mariadb container may fail because the mysql user configuration times out before the server has a chance to evaluate kolla's SQL script.

If your want the full story, brace yourself and read the absurdly long explanation below...

In my previous comment I had one comment wrong, and I couldn't figure out why mariadb was stopped a couple of seconds _after_ the bootstrap script failed. Let me clear out all the doubts here.

Logs used to analyze the failed CI jobs are available here:
 . undercloud installation log [1]
 . undercloud's journal [2]
 . mariadb log [3]

0) First a refresher:
---------------------

In a tripleo-deployed stack, the mysql server runs three times throughout deployment:

* during deployment step 2 - it runs twice in a transient container 'mysql_bootstrap'
  1. mysql server is first started for the kolla bootstrap [4], without network, without wsrep replication
     it's started to set root password, delete default users, and then stopped
  2. another mysql server is started by t-h-t [5], without network, without wsrep replication
     it's started to create a healcheck user, and then stopped

* during deployment step 3 - it runs the database service in container 'mysql' 
  3. it's started  by t-h-t, with network. this one is never stopped