rabbitmq-server died abruptly

Bug #1747347 reported by Tejeev Patel
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Triaged
High
Unassigned

Bug Description

We saw a rabbimq-server unit which is the leader in a cluster with 2 others die abruptly. Cluster status on other units remained ok, except for inability to communicate with the downed node. We were unable to determine a root cause. We've seen this happen on 3 other occasions and on one we observed that before starting the rabbit service the jujud agent was in error state; afterwards it was in executing state. I do not know if that was the case this time.

== LOGS ETC. ==

Service status after failure:

https://pastebin.canonical.com/p/nbcGBmr2b5/

================================

From juju unit log, I see repetitions similar to the first section (02:05:42-02:05:44) going back for days. At the time of the failure 02:05:45, it abruptly cuts away to reporting inability to connect. This section of enteries (02:05:45-02:10:49) basically repeats until the service is manually restarted around 02:46:18, where it goes back to the WARNING and entries resembling the first section:

https://pastebin.canonical.com/p/ZGjtM647JZ/

================================

In the rabbit logs (/<email address hidden>) we see it abruptly stop logging at the time of death and then start up again with the unit starting up at recovery:

https://pastebin.canonical.com/p/vd5kj3CvDf/

================================

syslog from time of failure to time of recovery:

https://pastebin.canonical.com/p/pVpFTVHXC3/

Revision history for this message
Tejeev Patel (tejeevpatel) wrote :

This cloud is running juju 2.2.4-xenial-amd64

Alvaro Uria (aluria)
tags: added: canonical-bootstack
Revision history for this message
Paul Gear (paulgear) wrote :

Moved logs to pastebins for easier perusal.

@openstack-charmers, any chance of triage and suggestions for next steps?

description: updated
Revision history for this message
Paul Gear (paulgear) wrote :

Further to the above, this problem is still occurring and causing rabbitmq crashes. Juju unit logs show this on leader: https://pastebin.canonical.com/p/m4WcRrnhDr/ and this on non-leaders: https://pastebin.canonical.com/p/hs9NKfZ59z/

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Is it possible to get a reproducer bundle? Alternately, would a bundle like:

```
applications:
  rabbit:
    charm: cs:openstack-charmers-next/rabbitmq-server
    num_units: 3
```

be enough of a reproducer? Without a reproducer, it'll be difficult to track down what this issue could be, and whether it is a potential charm or upstream issue.

Looking briefly through the Juju unit logs, I notice the line:

min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment.

In a clustered deployment, the min-cluster-size configuration option is fairly important, as it communicates to the charms what to expect with regards to cluster sizing. I don't think that it would cause random failures as seen here but I'm not 100% sure of the implication of having that incorrect post-bootstrap.

Changed in charm-rabbitmq-server:
importance: Undecided → High
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.