internal service calls cannot handle interrupted requests

Bug #740674 reported by Robert Collins
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Low
Unassigned

Bug Description

When an internal service fails its load balancer (haproxy today, lvs+haproxy in the future) cannot resurrect already dispatched requests. We will see a cluster of oopses at the time that that occurs - those oopses are genuine and we should see them (we may need to take some remedial action, for instance). Where its safe, we may want to automate the remedial action (which for idempotent services (like GET requests)) could well be to re-request and not-oops about the initial error.

We expect a vanishingly small error rate here - something like 1 failure per 20 million front end requests)

tags: added: branch-puller
description: updated
Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 740674] Re: ConnectionRefusedError from code puller

I believe branches have a field for showing connection errors already;
there's no particular need to log this if the data is put into that
field.

Revision history for this message
Aaron Bentley (abentley) wrote : Re: ConnectionRefusedError from code puller

This appears to be a problem communicating with our own XMLRPC server, not the remote branch. It seems OOPS-worthy, because we cannot complete a pull correctly if we cannot write the results to LP.

Revision history for this message
Aaron Bentley (abentley) wrote :

This is a common problem for tellurium, which is a staging server. It is rare for crowberry, the production version; the most recent instance was OOPS-1948SMP25, and OOPS-1908SMP33 before that.

Revision history for this message
Aaron Bentley (abentley) wrote :

According to the log, this is happening when trying to record the outcome of a mirror operation.

Aaron Bentley (abentley)
Changed in launchpad:
assignee: nobody → Aaron Bentley (abentley)
Revision history for this message
Robert Collins (lifeless) wrote :

I agree that a problem connecting to the internal API should be an OOPS. Its a bit worrying that some of the OOPS are coming from [qa]staging with a production prefix; perhaps thats historic and fixed, but lets check for it anyhow.

Revision history for this message
Aaron Bentley (abentley) wrote :

Of the actual production examples, OOPS-1948SMP25 appears to be due to the banana meltdown: https://wiki.canonical.com/IncidentReports/2011-05-02-LP-Frontend-server-banana-died

Aaron Bentley (abentley)
summary: - ConnectionRefusedError from code puller
+ Code puller can't access XMLRPC server
Aaron Bentley (abentley)
description: updated
Aaron Bentley (abentley)
Changed in launchpad:
assignee: Aaron Bentley (abentley) → nobody
Revision history for this message
Robert Collins (lifeless) wrote :

There are two sorts of failure modes here - cannot talk to haproxy, and server-melted down mid-request. For the former IS are going to implement LVS around the middle of June. That leaves dealing with requests that were in-progress at the time the backend blows up; for those we may want to do a retry mechanism, but meltdowns are so rare that the number of in-flight requests affected will be tiny - and its appropriate that we find out that happened. Accordingly I think this is as fixed as it sensibly can be short of re-requesting those requests. I'm going to leave it open, narrowed in focus to that situation.

summary: - Code puller can't access XMLRPC server
+ internal service calls cannot handle interrupted requests
Changed in launchpad:
importance: Critical → Low
tags: removed: branch-puller oops
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.