webapp includes 5XX errors generated by robots in /+opstats reports

Bug #426416 reported by Michael Barnett
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Low
Stuart Bishop

Bug Description

Launchpad should report 5xx error results returned to users as part of +opstats, in addition to the existing metrics of all 5xx error results from Launchpad (+opstats) and all 5xx error results (from parsing the Apache logs, which includes the various error states from the Apache load balancer when requests never get as far as Launchpad).

The Losas feel this metric will be better for nagios alerts, triggering when real users are being affected rather than when a robot is scouring an obscure part of the site and generating large amounts of errors.

Revision history for this message
Michael Barnett (mbarnett) wrote :

This is blocking rt 35533

Gary Poster (gary)
Changed in launchpad:
importance: Undecided → Low
assignee: nobody → Stuart Bishop (stub)
status: New → Triaged
Revision history for this message
Stuart Bishop (stub) wrote :

Why are errors created by robots not important? Any 5xx error is an issue that needs fixing.

If you want to track errors returned to humans rather than just errors, we should add a new metric to opstats. Someone will need to define how we detect a robot (this should be common code - we might need it elsewhere?)

Revision history for this message
Stuart Bishop (stub) wrote :

Sorry - I just noticed that this was Apache, not the 5xx error counts reported by +opstats. I'm not familiar with this code or what it is used for.

Gary Poster (gary)
Changed in launchpad:
assignee: Stuart Bishop (stub) → Gary Poster (gary)
Revision history for this message
Michael Barnett (mbarnett) wrote :

I stand corrected. We should keep the original metric and just add a new one that excludes robots. We will continue to graph the total 5XXs, but we can graph and alert off the new metric that excludes robots. We are trying to avoid having nagios alert on problems that we have no ability to resolve, and this new metric would far better reflect issues affecting the 'user experience' thus requiring immediate response.

I need a little clarification about Stuart's second comment re: apache vs. +opstats to comment. I am unclear on exactly what is being said there.

Gary Poster (gary)
Changed in launchpad:
assignee: Gary Poster (gary) → Stuart Bishop (stub)
Revision history for this message
Gary Poster (gary) wrote :

Stuart, I think your original take (this is something to be changed in +opstats) is correct. Michael agrees that we should add a new metric, not change the current one.

In regards to filtering robots, the oops tools have a simple regex for this which might be a start. See oopstools/oops/models.py in bzr+ssh://bazaar.launchpad.net/~launchpad-pqm/oops-tools/trunk/ . Maybe we have code elsewhere that's usable.

Gary

Stuart Bishop (stub)
description: updated
Stuart Bishop (stub)
Changed in launchpad:
status: Triaged → In Progress
Revision history for this message
Stuart Bishop (stub) wrote :

I'm landing a new metric to the +opstats page - 5XXs_b. This is the number of 500 series response codes sent to web browsers. It should ignore robots and tools using the launchpad apis. There will be false positives and false negatives, but I think it meets our needs.

affects: launchpad → launchpad-foundations
Changed in launchpad-foundations:
milestone: none → 3.0
Revision history for this message
Diogo Matsubara (matsubara) wrote : Bug fixed by a commit
Changed in launchpad-foundations:
status: In Progress → Fix Committed
Stuart Bishop (stub)
Changed in launchpad-foundations:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.