Stuck on "Cluster has no quorum as visible from <leader_ip> and cannot process write transactions. 2 members are not active"
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
MySQL InnoDB Cluster Charm | Status tracked in Trunk | |||||
Jammy |
New
|
Undecided
|
Unassigned | |||
Trunk |
Triaged
|
Medium
|
Unassigned |
Bug Description
All mysql-innodb-
This happened during the weekend, without any heavy load or operations going on in parallel.
ubuntu@
Model Controller Cloud/Region Version SLA Timestamp
neutron-work przemeklal-
App Version Status Scale Charm Store Rev OS Notes
mysql-innodb-
Unit Workload Agent Machine Public address Ports Message
mysql-innodb-
mysql-innodb-
mysql-innodb-
Machine State DNS Inst id Series AZ Message
0 started 10.5.0.7 7268ef34-
1 started 10.5.0.18 489ae28a-
2 started 10.5.0.9 6e9d1f71-
In juju debug-log there are repeated error messages for units /1 and /2 (every 5 mins):
unit-mysql-
Traceback (most recent call last):
File "<string>", line 2, in <module>
SystemError: RuntimeError: Dba.get_cluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active)
I suspected that it could be the same issue as described in bug reports lp:1889792, lp:1881735 or lp:1901771. However the error message in traceback is different and the issue didn't occur during deployment or heavy load. Still, there's a good chance that these are related. Also, running steps from this comment: https:/
unit-mysql-
unit-mysql-
unit-mysql-
unit-mysql-
unit-mysql-
unit-mysql-
unit-mysql-
unit-mysql-
I attached mysql daemon error logs from all instances. They show transient connectivity errors, however leader instance 0 can access units 1 and 2 on port 3306:
root@juju-
Starting Nmap 7.80 ( https:/
Nmap scan report for juju-221534-
Host is up (0.00096s latency).
PORT STATE SERVICE
3306/tcp open mysql
MAC Address: FA:16:3E:76:78:2D (Unknown)
Nmap done: 1 IP address (1 host up) scanned in 0.31 seconds
root@juju-
Starting Nmap 7.80 ( https:/
Nmap scan report for juju-221534-
Host is up (0.0018s latency).
PORT STATE SERVICE
3306/tcp open mysql
MAC Address: FA:16:3E:8D:F1:5B (Unknown)
Nmap done: 1 IP address (1 host up) scanned in 0.28 seconds
Journal logs for mysql systemd service on units 1 and 2 show that service has been restarted 3 times, but didn't recover correctly afterwards. Restarting it manually doesn't help.
Changed in charm-mysql-innodb-cluster: | |
status: | New → Confirmed |
Changed in charm-mysql-innodb-cluster: | |
assignee: | nobody → Dariusz Smigiel (smigiel-dariusz) |
Changed in charm-mysql-innodb-cluster: | |
status: | Triaged → In Progress |
tags: |
added: good-first-bug removed: onboarding |
Changed in charm-mysql-innodb-cluster: | |
assignee: | Dariusz Smigiel (smigiel-dariusz) → nobody |
Changed in charm-mysql-innodb-cluster: | |
assignee: | nobody → Paulo Machado (paulomachado) |
Przemysław,
Hi, the logs seem to indicate network connectivity problems. MySQL InnoDB cluster is fairly sensitive to connectivity failures and eventually gave up.
2021-02- 27T22:25: 12.339883Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.' 27T22:25: 14.808863Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.' 27T22:25: 34.802640Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.' 27T22:25: 55.080743Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.' 27T22:26: 25.070488Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.' 27T22:26: 27.034761Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.' 27T22:26: 47.028794Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.' 27T22:26: 49.132067Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.' 27T22:26: 55.134889Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.' 27T22:27: 00.961542Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-
2021-02-
2021-02-
2021-02-
2021-02-
2021-02-
2021-02-
2021-02-
2021-02-
To recover this cluster you can run the `reboot- cluster- from-complete- outage` action [0]. Note, if the output suggest the instance you have run the action on does not have the latest GTID state, run it on another until successful.
Clearly, we have some documentation bugs. I have already filed one on the ambiguity of "MySQL InnoDB Cluster not healthy: None" [1]. We may turn this bug into a documentation bug for the need to `reboot- cluster- from-complete- outage` when the cluster is fully stopped.
[0] https:/ /github. com/openstack/ charm-mysql- innodb- cluster/ blob/master/ src/actions. yaml#L28 /bugs.launchpad .net/charm- mysql-innodb- cluster/ +bug/1917337
[1] https:/