juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction ObjectIdHex("stuff")

Bug #1643795 reported by Junien Fridrick
50
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Hi,

Today a controller agent jujud starting going crazy, using a whole lot of CPU, taking mongodb along with it.

The first sign of madness in machine-0 logs are :

2016-11-22 01:36:34 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: failed to reload transaction: not found
2016-11-22 01:36:37 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find transaction ObjectIdHex("5833a0fb9e26051a7b966e52")

with the last line repeating quite a bit.

We restarted jujud-machine-0 and juju-db on the controller, and apparently they're back up and working, even though jujud is still consuming between 150 and 200% CPU, and complaining about the missing transaction.

This is not an HA-enabled controller, so there is a single, non-replicated mongodb database.

We haven't run mgopurge.

I tried looking for the transaction, and indeed it's gone. The following displays nothing :
juju:PRIMARY> db.txns.find(ObjectId("5833a0fb9e26051a7b966e52"))

While looking at syslog, I couldn't find where this transaction got removed. I found references to it in the "leases" collection, and the first entry for this collection is quite MASSIVE :

{ "_id" : "3076959e-34a7-4400-8a66-d8e00c070486:clock#application-leadership#", "type" : "clock", "namespace" : "application-leadership", "writers" : { "machine-0" : NumberLong("1479778593916131304") }, "model-uuid" : "3076959e-34a7-4400-8a66-d8e00c070486", "txn-revno" : NumberLong(582949), "txn-queue" : [ "580774929e26053ec3d5565b_dd10f0cb", "5808d0749e26053ec3d69e15_882d8cac", <= and more that 55000 (that's right, 55k) txn ids follow. Is that normal ?

Note that this bug appears to be similar to https://bugs.launchpad.net/juju-core/+bug/1528261

This is with juju version 2.0.2

Revision history for this message
Junien Fridrick (axino) wrote :

Relevant logs at : https://private-fileshare.canonical.com/~axino/lp1643795

If you need anything, please let me know, thanks !

tags: added: canonical-is
Changed in juju:
importance: Undecided → High
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

This means that MongoDB has managed to lose data and some of the lost data is mgo/txn's transaction bookkeeping data. This means that no transaction can run.

The reason for the 55000 txn-queue entries on that document is that a transaction affects that document every 30s or so but because no transactions can currently complete so pending transactions are building up.

The only solution is to shut down the controller machine agent and run mgopurge. Please make sure you use the latest version from https://github.com/juju/mgopurge. Some important fixes were made to mgopurge last week.

If you need someone to build mgopurge for you and send you the binary let me know.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Junien,
There is a detailed step-by-step guide on how to run mgopurge on https://github.com/juju/juju/wiki/Incomplete-Transactions-and-MgoPurge.

Please reach out if you have further questions!

Changed in juju:
status: New → Won't Fix
Revision history for this message
Junien Fridrick (axino) wrote :

Hi,

I'll take a look at mgopurge. In the meantime, shouldn't juju be able to handle these errors more gracefully ? If you have a wiki page describing how to recover from a lost transaction, I guess this is not a rare problem.

Having an unreponsive controller that is overloading the CPU is not a good experience IMO, even if the root cause is not juju itself.

Junien Fridrick (axino)
Changed in juju:
status: Won't Fix → New
Revision history for this message
Junien Fridrick (axino) wrote :

So I ran mgopurge (which I built from github), and it worked. Then I started the machine agent, and it's still using ~150-200% CPU with no good reason. I can't get an answer to "juju status". I can see the other agents attempting to connect, only to be met with :

{"request-id":1,"error":"login for \"unit-nrpe-0\" blocked because upgrade in progress","response":"'body redacted'"} Admin[""].Login

The first document in the "leases" collection now has around 155k IDs in txn-queue. I stopped jujud-machine-0 as I don't want it to grow more.

Full logs are at https://private-fileshare.canonical.com/~axino/lp1643795/2/

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Did mgopurge indicate that it fixed some issues?

The high CPU and lack of responsiveness is possibly because the machine agent is busy processing all those backed up transactions. If left for a while things might return to normal.

Can you start up the machine agent and monitor the counts of documents in txns collection with a "s" field (state) of 5 or 6 (aborted or completed) vs those with other state values? You should see transactions steadily moving to aborted or completed (mostly completed).

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Never mind my question above. I see from the mgopurge output you linked to that mgopurge did indeed correct some transaction data for leases collection. That's good.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Based on the tail of the mongodb and machine agent logs you've provided, the machine agent is looking reasonably healthy and the it appears to be churning through all those backed up transactions. Giving the machine agent some time to process through them seems like the right approach.

Revision history for this message
Junien Fridrick (axino) wrote :

OK, it's back up. I'll let you know how it goes.

Revision history for this message
Junien Fridrick (axino) wrote :

Another argument for stopping the machine agent (or "soft stopping" it, i.e. put it in frozen mode or something) is that you then don't have transactions getting queued and needing to be processed whenever the DB is back to OK.

Revision history for this message
Junien Fridrick (axino) wrote :

Is there any way I can measure the progress of the recovery ? It's been nearly a day...

Revision history for this message
Junien Fridrick (axino) wrote :

So after "just" maxxing out the CPUs for a while, juju starts taking up all the memory, and the machine becomes basically useless.

I can see the classic "2016-11-24 10:27:18 ERROR juju.worker.dependency engine.go:539 "disk-manager" manifold worker returned unexpected error: cannot list block devices: lsblk failed: fork/exec /bin/lsblk: cannot allocate memory" in the logs.

I have TRACE level logs of this happening in https://private-fileshare.canonical.com/~axino/lp1643795/3/

I'm going to have to tear down the environment tomorrow though. I guess replicating the disappearance of a mongo document is easy enough to replicate if you need further testing.

Changed in juju:
status: New → Triaged
milestone: none → 2.1.1
Revision history for this message
Anastasia (anastasia-macmood) wrote :

This work, whilst desirable, will not happen until later stage, post-2.3. I am triaging this back into wishlist.

Current workaround, when these log messages appear is to run mgopurge.

Changed in juju:
importance: High → Wishlist
milestone: 2.1.1 → none
Revision history for this message
James Troup (elmo) wrote :

Hey, so this bug is regularly causing outages for us and wakes people up at weekends. Wishlist is a very hostile status for people trying to use Juju in production.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

We have changes in mgopurge v1.4 (released today) and Juju 2.2 (out soon) which make significant improvement to the way txns are pruned. This makes massively reduces the amount of txn data that Juju has to track, reducing the scope for txn corruption.

tags: added: cpe-onsite
tags: added: 4010
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Wishlist → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.