PCJ race between process-job-source.py and celery can generate OOPS

Bug #1314569 reported by Colin Watson
36
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Critical
Unassigned

Bug Description

I got OOPS-7d0f700be19191e98139cdab67a81ea7, which is:

  InvalidTransition: Transition from Running to Running is invalid.

    Traceback (most recent call last):
  Module lazr.jobrunner.jobrunner, line 194, in runJobHandleError
    self.runJob(job, fallback)
  Module lp.services.job.runner, line 289, in runJob
    super(BaseJobRunner, self).runJob(IRunnableJob(job), fallback)
  Module lazr.jobrunner.jobrunner, line 159, in runJob
    job.start(manage_transaction=True)
  Module lp.services.job.model.job, line 169, in start
    self._set_status(JobStatus.RUNNING)
  Module lp.services.job.model.job, line 120, in _set_status
    raise InvalidTransition(self._status, status)
InvalidTransition: Transition from Running to Running is invalid.

    <oops-message-0>: {'target_archive_id': 1, 'package_copy_job_type': 'Copy packages between archives.', 'job_id': 23532039, 'target_distroseries_id': 108, 'package_copy_job_id': 279234, 'source_archive_id': 1}

This was because the job had been picked up by celery at almost exactly the same time:

[2014-04-30 09:23:13,769: DEBUG3/PoolWorker-3] new transaction
[2014-04-30 09:23:13,881: INFO/PoolWorker-3] Running <PlainPackageCopyJob to copy package gnome-settings-daemon from ubuntu/primary to ubuntu/primary, UPDATES pocket, in ubuntu precise, including binaries> (ID 23532039) in status Waiting

2014-04-30 09:23:13 DEBUG Trying to acquire lease for job in state Waiting
2014-04-30 09:23:13 INFO Running <PlainPackageCopyJob to copy package gnome-settings-daemon from ubuntu/primary to ubuntu/primary, UPDATES pocket, in ubuntu precise, including binaries> (ID 23532039) in status Running
2014-04-30 09:23:14 INFO Job resulted in OOPS: OOPS-7d0f700be19191e98139cdab67a81ea7

So this is harmless in that the copy happened anyway, but Critical by Launchpad bug policy since it shouldn't generate an OOPS.

I thought the point of acquiring a lease for the job was that it couldn't be picked up by another job runner. Does celery not honour that?

Revision history for this message
Colin Watson (cjwatson) wrote :

I think the problem may be in lazr.jobrunner. RunJob.run does indeed do a job.acquireLease(), but it doesn't commit the transaction at that point (unlike JobRunner.runAll) so other processes won't see it.

William Grant (wgrant)
Changed in launchpad:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.