percona-cluster crashes on artful deploys

Bug #1728132 reported by Ryan Beisner
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Percona Cluster Charm
Invalid
High
Unassigned
percona-xtradb-cluster-5.6 (Ubuntu)
Fix Released
High
James Page
Artful
Fix Committed
High
James Page
Bionic
Fix Released
High
James Page

Bug Description

>> Impact <<
percona-xtradb-cluster 5.6 is unusable on Ubuntu >= artful

>> Test Case <<
Deploy base openstack bundle using openstack charms (this includes percona-xtradb-cluster-5.6); deployment will fail as openstack services attempt to create database schemas with the stack trace in the original bug report).

>> Regression Potential <<
minimal; the changes to the packaging force use of gcc-6 (rather than gcc-7) which was know to work @ zesty.

>> Original Bug Report <<
On Artful x86_64, percona-cluster shared-db-relation-changed hooks frequently error out, but not always. Various db migrate operations fail (sometimes keystone, sometimes glance).

Juju unit logs indicate db migrate operations were underway on the api unit, and system logs indicate mysql (percona-cluster) crashed on the db unit.

Artifacts:
https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_amulet_full/openstack/charm-ceph/508041/4/743/consoleText.test_charm_amulet_full_1111.txt

percona-cluster/0* error idle 8 172.17.106.18 3306/tcp hook failed: "shared-db-relation-changed" for glance:shared-db

# A simple percona-cluster exercise on Artful, can readily repro with this on serverstack
series: artful
relations:
- - keystone:shared-db
  - percona-cluster:shared-db
- - glance:identity-service
  - keystone:identity-service
- - glance:amqp
  - rabbitmq-server:amqp
services:
  percona-cluster:
    charm: cs:~openstack-charmers-next/percona-cluster
    num_units: 1
    constraints: "cpu-cores=4 mem=4G"
    options:
      max-connections: 1000
      innodb-buffer-pool-size: 256M
  keystone:
    charm: cs:~openstack-charmers-next/keystone
    num_units: 1
    constraints: "cpu-cores=2 mem=2G"
    options:
      admin-password: openstack
      worker-multiplier: 0.25
  glance:
    charm: cs:~openstack-charmers-next/glance
    num_units: 1
    constraints: "cpu-cores=2 mem=2G"
    options:
      worker-multiplier: 0.25
  rabbitmq-server:
    charm: cs:~openstack-charmers-next/rabbitmq-server
    num_units: 1
    constraints: "cpu-cores=4 mem=4G"

DEBUG:bundletester.utils:Updating JUJU_MODEL: "auto-osci-sv06:admin/auto-osci-sv06" -> ""
DEBUG:bundletester.fetchers:git rev-parse HEAD: d0d41964227bc1dc46887fa77ba4be3bc5065fae

ERROR: InvocationError: '/var/lib/jenkins/checkout/0/ceph/.tox/func27/bin/bundletester -vl DEBUG -r json -o func-results.json --test-pattern gate-* --no-destroy'
___________________________________ summary ____________________________________
ERROR: func27: commands failed
 ! Amulet test failed.
Model Controller Cloud/Region Version SLA
auto-osci-sv06 auto-osci-sv06 serverstack/serverstack 2.2.4 unsupported

App Version Status Scale Charm Store Rev OS Notes
ceph 12.2.0 active 3 ceph local 105 ubuntu
ceph-osd 12.2.0 active 1 ceph-osd jujucharms 297 ubuntu
cinder 11.0.0 waiting 1 cinder jujucharms 298 ubuntu
cinder-ceph waiting 0/1 cinder-ceph jujucharms 247 ubuntu
glance 15.0.0 waiting 1 glance jujucharms 292 ubuntu
keystone 12.0.0 waiting 1 keystone jujucharms 322 ubuntu
nova-compute 16.0.1 active 1 nova-compute jujucharms 336 ubuntu
percona-cluster 5.6.34-26.19 error 1 percona-cluster jujucharms 274 ubuntu
rabbitmq-server 3.6.10 active 1 rabbitmq-server jujucharms 280 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-osd/0* active idle 3 172.17.106.17 Unit is ready (2 OSD)
ceph/0* active idle 0 172.17.106.10 Unit is ready and clustered
ceph/1 active idle 1 172.17.106.5 Unit is ready and clustered
ceph/2 active idle 2 172.17.106.3 Unit is ready and clustered
cinder/0* waiting executing 4 172.17.106.20 8776/tcp Incomplete relations: messaging, identity, database
  cinder-ceph/0* waiting allocating 172.17.106.20 agent initializing
glance/0* waiting executing 5 172.17.106.22 9292/tcp Incomplete relations: identity, database
keystone/0* waiting executing 6 172.17.106.21 5000/tcp Incomplete relations: database
nova-compute/0* active executing 7 172.17.106.4 Unit is ready
percona-cluster/0* error idle 8 172.17.106.18 3306/tcp hook failed: "shared-db-relation-changed" for glance:shared-db
rabbitmq-server/0* active idle 9 172.17.106.26 5672/tcp Unit is ready

Machine State DNS Inst id Series AZ Message
0 started 172.17.106.10 cc716b8d-f4ff-450d-b088-a3742a5ffffd artful nova ACTIVE
1 started 172.17.106.5 e0498107-15fe-4a5e-93a2-a1f6cfd9523b artful nova ACTIVE
2 started 172.17.106.3 6a02e6cc-2fdb-4b81-b4ab-0467c5e5705d artful nova ACTIVE
3 started 172.17.106.17 cc5a2f76-73a6-4f50-a97e-ae391ed96e63 artful nova ACTIVE
4 started 172.17.106.20 4d42c847-9dd3-4f1a-a668-0b5eb7819b9a artful nova ACTIVE
5 started 172.17.106.22 8dd50778-396b-4c27-b063-34ff63ddb3f2 artful nova ACTIVE
6 started 172.17.106.21 58eac09b-88d5-4f1b-a97d-b22176cc231f artful nova ACTIVE
7 started 172.17.106.4 534d23a0-5c30-451a-8963-68057bac27f8 artful nova ACTIVE
8 started 172.17.106.18 091cb17d-5898-44b4-9f0a-3bffbe808d05 artful nova ACTIVE
9 started 172.17.106.26 eb03b662-332e-48d5-8b5d-c39e3deadb4a artful nova ACTIVE

Relation provider Requirer Interface Type
ceph:client cinder-ceph:ceph ceph-client regular
ceph:client glance:ceph ceph-client regular
ceph:client nova-compute:ceph ceph-client regular
ceph:mon ceph:mon ceph peer
ceph:osd ceph-osd:mon ceph-osd regular
cinder-ceph:storage-backend cinder:storage-backend cinder-backend subordinate
cinder:cluster cinder:cluster cinder-ha peer
glance:cluster glance:cluster glance-ha peer
glance:image-service cinder:image-service glance regular
glance:image-service nova-compute:image-service glance regular
keystone:cluster keystone:cluster keystone-ha peer
keystone:identity-service cinder:identity-service keystone regular
keystone:identity-service glance:identity-service keystone regular
nova-compute:compute-peer nova-compute:compute-peer nova peer
percona-cluster:cluster percona-cluster:cluster percona-cluster peer
percona-cluster:shared-db cinder:shared-db mysql-shared regular
percona-cluster:shared-db glance:shared-db mysql-shared regular
percona-cluster:shared-db keystone:shared-db mysql-shared regular
percona-cluster:shared-db nova-compute:shared-db mysql-shared regular
rabbitmq-server:amqp cinder:amqp rabbitmq regular
rabbitmq-server:amqp glance:amqp rabbitmq regular
rabbitmq-server:amqp nova-compute:amqp rabbitmq regular
rabbitmq-server:cluster rabbitmq-server:cluster rabbitmq-ha peer

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Revision history for this message
David Ames (thedac) wrote :

Defintely a hard crash:

16:03:03 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=2
max_threads=1002
thread_count=4
connection_count=2
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 431972 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x55bcb9bf91b0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...

Initial look at the memory settings seem sane. We need to figure out if this is charm or package related.

Changed in charm-percona-cluster:
importance: Undecided → Critical
Revision history for this message
James Page (james-page) wrote :
Revision history for this message
James Page (james-page) wrote :

Looking at the backtrace, signal is being generated in this code block:

    if (UNIV_UNLIKELY(thr && thr_get_trx(thr)->fake_changes)) {
        /* skip CHANGE, LOG */
        *big_rec = big_rec_vec;
        return(err); /* == DB_SUCCESS */
    }

thr is 0x0 (NULL) in this case so thr_get_trx should not really be evaluated, but it is.

Revision history for this message
James Page (james-page) wrote :

Reference:

#0 0x0000563d4629c8be in thr_get_trx (thr=0x0) at /build/percona-xtradb-cluster-5.6-EJAmtv/percona-xtradb-cluster-5.6-5.6.34-26.19/storage/innobase/include/que0que.ic:38

..

#1 btr_cur_optimistic_insert (flags=flags@entry=7, cursor=cursor@entry=0x7fc078270a70, offsets=offsets@entry=0x7fc078270a60, heap=heap@entry=0x7fc078270a68, entry=entry@entry=0x7fc04403c938, rec=rec@entry=0x7fc078270a58, big_rec=0x7fc078270a50, n_ext=<optimized out>, thr=0x0, mtr=0x7fc078271310) at /build/percona-xtradb-cluster-5.6-EJAmtv/percona-xtradb-cluster-5.6-5.6.34-26.19/storage/innobase/btr/btr0cur.cc:1510

Revision history for this message
James Page (james-page) wrote :

UNIV_UNLIKELY is a macro that hints to the compiler - I wonder whether that causes us pain with gcc-7

Revision history for this message
James Page (james-page) wrote :
Changed in charm-percona-cluster:
status: New → Invalid
Revision history for this message
James Page (james-page) wrote :

Marking charm bug task as invalid - this appears to be a gcc-7 type issue.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in percona-xtradb-cluster-5.6 (Ubuntu):
status: New → Confirmed
Ryan Beisner (1chb1n)
Changed in charm-percona-cluster:
importance: Critical → High
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Right now, OpenStack Charm Engineering are unable to test Bionic and Artful OpenStack clouds due to this issue, which puts us at a stand-still in preparing for the next LTS.

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Revision history for this message
Ryan Beisner (1chb1n) wrote :

As a work-around for enabling testing of OpenStack on >= Artful, we can force percona-cluster juju units to a series: xenial. This will allow everything in the deployment, except for percona-cluster, to be validated on the later series.

Revision history for this message
James Page (james-page) wrote :

I think we need to unblock this in the short term while work happens on 5.7; so I've built PXC packages using gcc-5:

 https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3060

and gcc-6 for testing purposes:

 https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3061

If gcc-6 built binaries prove stable, I'd propose we force pxc builds for 5.6 to using gcc-6 (which is trivial todo); I checked with doko and both 5 & 6 binaries will remain in archive (albeit 5 will be in universe) for bionic so we can 'fix' things this way.

Changed in percona-xtradb-cluster-5.6 (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged
Changed in percona-xtradb-cluster-5.6 (Ubuntu Artful):
status: New → Triaged
importance: Undecided → High
assignee: nobody → James Page (james-page)
Changed in percona-xtradb-cluster-5.6 (Ubuntu Bionic):
assignee: nobody → James Page (james-page)
James Page (james-page)
description: updated
description: updated
description: updated
description: updated
Revision history for this message
James Page (james-page) wrote :

gcc-6 built binaries pass testing (deployment and tempest openstack tests); proposing binaries built using gcc-6 as a solution to this problem; this does not remove the requirement to move forward with pxc 5.7 built using gcc-7.

Changed in percona-xtradb-cluster-5.6 (Ubuntu Artful):
status: Triaged → In Progress
Changed in percona-xtradb-cluster-5.6 (Ubuntu Bionic):
status: Triaged → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package percona-xtradb-cluster-5.6 - 5.6.34-26.19-0ubuntu5

---------------
percona-xtradb-cluster-5.6 (5.6.34-26.19-0ubuntu5) bionic; urgency=medium

  * Switch back to using gcc-6 as gcc-7 results in a
    broken PXC (LP: #1728132):
    - d/rules,control: BD on g{cc,++}-6, force use during build.
    - d/p/series: Disable gcc-7 related patches.
    - d/p/ibuf-uses-full-memory-barrier-ppc64.patch: Revert gcc-7
      specific changes.
    - d/rules: Drop gcc-7 related overrides for errors.
  * d/control: Update Vcs fields for Ubuntu.

 -- James Page <email address hidden> Thu, 30 Nov 2017 13:55:40 +0000

Changed in percona-xtradb-cluster-5.6 (Ubuntu Bionic):
status: In Progress → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Ryan, or anyone else affected,

Accepted percona-xtradb-cluster-5.6 into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/percona-xtradb-cluster-5.6/5.6.34-26.19-0ubuntu4.17.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in percona-xtradb-cluster-5.6 (Ubuntu Artful):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-artful
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.