lp.services.job.tests.test_runner.TestTwistedJobRunner.test_timeout_short fails rarely/intermittently in parallel tests

Bug #994777 reported by Gary Poster
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Critical
Unassigned

Bug Description

I was unable to duplicate this locally with a --load-list of this and the previous tests in the layer, but here's the worker list if you want to give it a try yourself: http://pastebin.ubuntu.com/967608/. I was unable to find any reports in the syslog of an OOM error or other system-level explanation for why this might have happened. Since the instance was ephemeral, I could not look for application logs that might explain the error, because they were already gone.

======================================================================
FAILURE: lp.services.job.tests.test_runner.TestTwistedJobRunner.test_timeout_short
worker ID: worker-7
----------------------------------------------------------------------
Traceback (most recent call last):
_StringException: librarian-log: {{{
2012-05-05 00:05:54+0530 [-] Log opened.
2012-05-05 00:05:54+0530 [-] twistd 11.1.0 (/usr/bin/python2.6 2.6.5) starting up.
2012-05-05 00:05:54+0530 [-] reactor class: twisted.internet.pollreactor.PollReactor.
2012-05-05 00:05:55+0530 [-] FileUploadFactory starting on 43924
2012-05-05 00:05:55+0530 [-] Starting factory <lp.services.librarianserver.libraryprotocol.FileUploadFactory instance at 0xad2f3cc>
2012-05-05 00:05:55+0530 [-] Site starting on 56593
2012-05-05 00:05:55+0530 [-] Starting factory <twisted.web.server.Site instance at 0xad2f2cc>
2012-05-05 00:05:55+0530 [-] FileUploadFactory starting on 56049
2012-05-05 00:05:55+0530 [-] Starting factory <lp.services.librarianserver.libraryprotocol.FileUploadFactory instance at 0xad2f92c>
2012-05-05 00:05:55+0530 [-] Site starting on 46226
2012-05-05 00:05:55+0530 [-] Starting factory <twisted.web.server.Site instance at 0xad2fd6c>
2012-05-05 00:05:55+0530 [-] Not using upstream librarian
2012-05-05 00:05:55+0530 [-] daemon ready!
}}}

oops-0: {{{
Oops-Id: OOPS-20e22c8641b789c42c94239681446a7d
Exception-Type: ProcessTerminated
Exception-Value: A process has ended with a probable error condition: process ended with exit code 255.
Date: 2012-05-04T18:39:14.080063+00:00
Branch: devel
Revision: 15208
Duration: -1
Oops-Reporter: T

00000-00003@memcache-set MemcachedLayer__live_test
00083-00084@memcache-set MemcachedLayer__live_test
00185-00187@memcache-set MemcachedLayer__live_test
06118-06131@librarian-connection http://localhost:56593/93/filename-100193
06131-06131@librarian-read http://localhost:56593/93/filename-100193
06146-06152@librarian-connection http://localhost:56593/93/filename-100193
06153-06153@librarian-read http://localhost:56593/93/filename-100193
06161-06167@librarian-connection http://localhost:56593/93/filename-100193
06167-06167@librarian-read http://localhost:56593/93/filename-100193
22007-22016@librarian-connection http://localhost:56593/93/filename-100438
22017-22017@librarian-read http://localhost:56593/93/filename-100438
53935-53961@librarian-connection http://localhost:56593/103/bar_1.0.orig.tar.gz
53962-53963@librarian-read http://localhost:56593/103/bar_1.0.orig.tar.gz
53963-53963@librarian-read http://localhost:56593/103/bar_1.0.orig.tar.gz
61663-61673@librarian-connection http://localhost:56593/93/pmount.changes
61674-61674@librarian-read http://localhost:56593/93/pmount.changes
62153-62158@librarian-connection http://localhost:56593/93/pmount.changes
62159-62159@librarian-read http://localhost:56593/93/pmount.changes
62879-62884@librarian-connection http://localhost:56593/97/cdrkit.changes
62884-62884@librarian-read http://localhost:56593/97/cdrkit.changes
63685-63691@librarian-connection http://localhost:56593/98/pmount.changes
63691-63691@librarian-read http://localhost:56593/98/pmount.changes
177113-177114@memcache-set somekey
177114-177114@memcache-get somekey
177870-177880@librarian-connection http://localhost:56593/93/filename-102625
177881-177881@librarian-read http://localhost:56593/93/filename-102625
177883-177889@librarian-connection http://localhost:56593/94/filename-102651
177890-177890@librarian-read http://localhost:56593/94/filename-102651
178248-178254@librarian-connection http://localhost:56593/95/filename-102677
178254-178254@librarian-read http://localhost:56593/95/filename-102677
178257-178263@librarian-connection http://localhost:56593/94/filename-102651
178263-178263@librarian-read http://localhost:56593/94/filename-102651
184606-184617@librarian-connection http://localhost:56593/95/foo_1.dsc
184617-184617@librarian-read http://localhost:56593/95/foo_1.dsc
184618-184618@librarian-read http://localhost:56593/95/foo_1.dsc
185847-185859@librarian-connection http://localhost:56593/94/foo_1.0-2_source.changes
185859-185859@librarian-read http://localhost:56593/94/foo_1.0-2_source.changes
187117-187132@librarian-connection http://localhost:56593/63/hwsubmission1.xml
187133-187133@librarian-read http://localhost:56593/63/hwsubmission1.xml
187295-187301@librarian-connection http://localhost:56593/93/hwinfo.xml
187302-187302@librarian-read http://localhost:56593/93/hwinfo.xml
187413-187422@librarian-connection http://localhost:56593/94/hwinfo.xml
187423-187423@librarian-read http://localhost:56593/94/hwinfo.xml
187528-187534@librarian-connection http://localhost:56593/95/hwinfo.xml
187534-187534@librarian-read http://localhost:56593/95/hwinfo.xml

Traceback (most recent call last):
ProcessTerminated: A process has ended with a probable error condition: process ended with exit code 255.
}}}

Traceback (most recent call last):
  File "/var/lib/buildbot/slaves/slave/lucid-devel/build/lib/lp/services/job/tests/test_runner.py", line 593, in test_timeout_short
    (oops['type'], oops['value']))
MismatchError: !=:
reference = ('TimeoutError', 'Job ran too long.')
actual = ('ProcessTerminated',
 u'A process has ended with a probable error condition: process ended with exit code 255.')

Brad Crittenden (bac)
Changed in launchpad:
assignee: nobody → Brad Crittenden (bac)
status: Triaged → In Progress
Revision history for this message
Brad Crittenden (bac) wrote :

When running the set of tests that originally caused this failure, I can reproduce it about 1 in every 20 runs.

When I just run tests of test_timeout_short I can also reproduce it about 1/20.

This is a spurious test failure that has nothing to do with paralleltest.

Gary Poster (gary)
tags: removed: paralleltest
Revision history for this message
Brad Crittenden (bac) wrote :

I can reliably reproduce this failure 5% of the time when the test is run by itself. So I am confused as to why we don't see periodic failures of this test in buildbot.

As a final test I ran test_timeout_long and test_timeout_short together to see if there was some interaction. Again I saw 1 failure out of 20.

After discussion with Gary I am classifying test_timeout_short as a spurious failure, marking it with an XXX referencing this bug, and disabling it for now.

Brad Crittenden (bac)
tags: added: spurious-test-failure
Changed in launchpad:
status: In Progress → Triaged
assignee: Brad Crittenden (bac) → nobody
importance: High → Critical
assignee: nobody → Jonathan Lange (jml)
Jonathan Lange (jml)
Changed in launchpad:
assignee: Jonathan Lange (jml) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.