Launchpad itself

internal service calls cannot handle interrupted requests

Bug #740674 reported by Robert Collins on 2011-03-23

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Triaged	Low	Unassigned

Bug Description

When an internal service fails its load balancer (haproxy today, lvs+haproxy in the future) cannot resurrect already dispatched requests. We will see a cluster of oopses at the time that that occurs - those oopses are genuine and we should see them (we may need to take some remedial action, for instance). Where its safe, we may want to automate the remedial action (which for idempotent services (like GET requests)) could well be to re-request and not-oops about the initial error.

We expect a vanishingly small error rate here - something like 1 failure per 20 million front end requests)

See original description

Diogo Matsubara (matsubara) on 2011-03-24

tags:	added: branch-puller
description:	updated

Revision history for this message

Robert Collins (lifeless) wrote on 2011-03-26: Re: [Bug 740674] Re: ConnectionRefusedError from code puller

I believe branches have a field for showing connection errors already;
there's no particular need to log this if the data is put into that
field.

Revision history for this message

Aaron Bentley (abentley) wrote on 2011-05-16: Re: ConnectionRefusedError from code puller

This appears to be a problem communicating with our own XMLRPC server, not the remote branch. It seems OOPS-worthy, because we cannot complete a pull correctly if we cannot write the results to LP.

Revision history for this message

Aaron Bentley (abentley) wrote on 2011-05-16:

This is a common problem for tellurium, which is a staging server. It is rare for crowberry, the production version; the most recent instance was OOPS-1948SMP25, and OOPS-1908SMP33 before that.

Revision history for this message

Aaron Bentley (abentley) wrote on 2011-05-16:

Log excerpt Edit (3.1 KiB, text/plain)

According to the log, this is happening when trying to record the outcome of a mirror operation.

Aaron Bentley (abentley) on 2011-05-16

Changed in launchpad:
assignee:	nobody → Aaron Bentley (abentley)

Revision history for this message

Robert Collins (lifeless) wrote on 2011-05-16:

I agree that a problem connecting to the internal API should be an OOPS. Its a bit worrying that some of the OOPS are coming from [qa]staging with a production prefix; perhaps thats historic and fixed, but lets check for it anyhow.

Revision history for this message

Aaron Bentley (abentley) wrote on 2011-05-16:

Of the actual production examples, OOPS-1948SMP25 appears to be due to the banana meltdown: https://wiki.canonical.com/IncidentReports/2011-05-02-LP-Frontend-server-banana-died

Aaron Bentley (abentley) on 2011-05-16

summary:

- ConnectionRefusedError from code puller
+ Code puller can't access XMLRPC server

Aaron Bentley (abentley) on 2011-05-16

description:

updated

Aaron Bentley (abentley) on 2011-05-17

Changed in launchpad:
assignee:	Aaron Bentley (abentley) → nobody

Revision history for this message

Robert Collins (lifeless) wrote on 2011-05-18:

There are two sorts of failure modes here - cannot talk to haproxy, and server-melted down mid-request. For the former IS are going to implement LVS around the middle of June. That leaves dealing with requests that were in-progress at the time the backend blows up; for those we may want to do a retry mechanism, but meltdowns are so rare that the number of in-flight requests affected will be tiny - and its appropriate that we find out that happened. Accordingly I think this is as fixed as it sensibly can be short of re-requesting those requests. I'm going to leave it open, narrowed in focus to that situation.

summary:	- Code puller can't access XMLRPC server + internal service calls cannot handle interrupted requests
Changed in launchpad:
importance:	Critical → Low
tags:	removed: branch-puller oops