(latest/edge) prometheus-relation-joined hook can cause a mysql error during deployment

Bug #2018385 reported by Alex Kavanagh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Fix Committed
Undecided
Unassigned
Jammy
New
Undecided
Unassigned

Bug Description

During the deployment of a cluster in the gate, there is a race-hazard error when the prometheus-relation-joined hook can result in a mysql error in the `create_user` method.

The issue is basically that if a transaction is attempted (with a commit) whilst the cluster is recovering Group Replication, then that commit will hard fail with the following error:

    MySQLdb.OperationalError: (3100, "Error on observer while running replication hook 'before_commit'.")

The trace from the error.log file provides more details:

2023-04-22T07:59:58.834202Z 0 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2023-04-22T07:59:58.834268Z 0 [Warning] [MY-013469] [Repl] Plugin group_replication reported: 'This member will start distributed recovery using clone. It is due to the num
ber of missing transactions being higher than the configured threshold of 1.'
2023-04-22T07:59:59.836244Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Cloning from a remote group don
or.'
2023-04-22T07:59:59.837981Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 172.16.0.174:3306, 172.16.0.245:3306, 172.16.0.101
:3306 on view 16821495149807797:5.'
2023-04-22T07:59:59.840913Z 38 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'
2023-04-22T08:00:00.285589Z 39 [Warning] [MY-013460] [InnoDB] Clone removing all user data for provisioning: Started
2023-04-22T08:00:00.550852Z 39 [Warning] [MY-013460] [InnoDB] Clone removing all user data for provisioning: Finished
2023-04-22T08:00:01.404748Z 41 [ERROR] [MY-011600] [Repl] Plugin group_replication reported: 'Transaction cannot be executed while Group Replication is recovering. Try agai
n when the server is ONLINE.'

Essentially, what seems to be happening is that the prometheus-relation-joined hook fires quickly after the vault-relation-joined which had caused the instance to change from not TLS to TLS (using the cert from vault) and this had caused group replication to be restarted.

Possible solution:
------------------

The solution is to retry the commit if the 3100 error occurs several times (to allow Group Replication to finish) and then just return False so that the handler will try again on the next hook execution. This would allow the unit to recover gracefully from the error.

Tags: sts
Seyeong Kim (seyeongkim)
tags: added: sts
Changed in charm-mysql-innodb-cluster:
assignee: nobody → Alex Kavanagh (ajkavanagh)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-mysql-innodb-cluster (master)
Changed in charm-mysql-innodb-cluster:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-mysql-innodb-cluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/883300
Committed: https://opendev.org/openstack/charm-mysql-innodb-cluster/commit/5fb5a05be128a2ec6a912394aacd14b44eb82998
Submitter: "Zuul (22348)"
Branch: master

commit 5fb5a05be128a2ec6a912394aacd14b44eb82998
Author: Alex Kavanagh <email address hidden>
Date: Tue May 16 20:21:43 2023 +0100

    Wait for Group Replication to finish; 3100 before commit error

    The bug is triggered, as a race, usually by the
    prometheus-relation-joined hook, when it tries to create a user whilst
    Group Replication is recovering during a rolling restart. This patch
    alters the create_user() method so that it detects the failure condition
    and then retries for up to a minute (6 times, every 10 seconds) for the
    Group Replication to recover before giving up and returning False
    (indicating the the user was not created). This will usually result in
    the handler not completing during the hook, and then retrying on the
    next hook.

    Change-Id: I5df4fd5ecbdd2b7bce525a9930dcffbc5868cbb8
    Closes-Bug: #2018385

Changed in charm-mysql-innodb-cluster:
status: In Progress → Fix Committed
Changed in charm-mysql-innodb-cluster:
assignee: Alex Kavanagh (ajkavanagh) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.