Failure during initial branch scan increases scan time enormously

Bug #1018460 reported by Aaron Bentley
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Critical
Unassigned

Bug Description

Excerpted from https://bugs.launchpad.net/launchpad/+bug/808930/comments/20 by William Grant:

...here's some diagnosis from experiments with lp:~wallyworld/launchpad/create-aag-privacy-transitions-1009364, a Launchpad branch which is now permanently cursed and unable to scan within the timeout.

wallyworld pushed the branch at around 01:04. It starts to scan, but times out 5 minutes later.

2012-06-18 01:05:11 INFO Running <SCAN_BRANCH branch job (4134693) for ~wallyworld/launchpad/create-aag-privacy-transitions-1009364> (ID 13838195).
2012-06-18 01:10:15 INFO Job resulted in OOPS: OOPS-415b4f54e93cb4a66345cf3807221142

Another scan is attempted at 01:11:10, timing out at 01:16:23 with OOPS-d214e0439183a7ec5ccb2bf73c34efec. wallyworld pushed it to a new name at 01:22, and the new branch scanned successfully in 109s.

I noticed the accursed branch and coordinated with wallyworld and spm to debug it a little. A scan was forced by pushing a new rev at 03:33, and the scan timed out as expected with OOPS-d83500d8b2cd4f2355127bb914508982. The branch state after this was https://pastebin.canonical.com/68306/ -- the most notable takeaway being that the branch has a full set of BranchRevisions, but no last_scanned_id. This was likely the state after the initial failed scan.

https://pastebin.canonical.com/68305/ is a series of snapshots of the relevant bits of pg_stat_activity throughout the scan. They reveal that it spent more than 4 minutes in the existing_branchrevisions query of planDatabaseChanges, checking for existing BranchRevisions for more than 122000 revids. While that query should in principle be satisfiable in just a second or so, it's not inconceivable that the parsing and planning overhead could be somewhat ridiculous. It may also generate a terrible plan when run with 122000 revids on a branch which already has 122000 BranchRevisions, as that's a situation that would not normally be encountered.

Aaron Bentley (abentley)
summary: - Failure to scan branch increases scan time enormously
+ Failure during initial branch scan increases scan time enormously
Revision history for this message
William Grant (wgrant) wrote :

Branches hit by this bug can now be unstuck by running 'scripts/unscan-branch.py --rescan lp:~PATH/TO/BRANCH' as bzrsyncd@ackee (or anywhere with access to the branchscanner DB user). It erases the BranchRevisions, last_scanned_id, and some other stuff, and requests a rescan from the normal branch scanner. For really big branches the DB cache may be too cold for the first scan to complete, so it might need to be retried a couple of times.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.