"testflinger-cli artifacts" should check checksum after artifacts be downloaded

Bug #1816970 reported by Alex Tu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Testflinger
New
Medium
Paul Larson

Bug Description

we got a random failure issue , I guess it's caused by timing

https://taipei-jenkins-docker.redirectme.net/view/Daily-Auto-Sanity-Staging/job/sanity-3-testflinger-dell-bto-bionic-wasp-n5-v5-201812-26762/38/consoleFull
    + testflinger-cli artifacts 766584d4-2d14-4059-a3d9-04124332d0b8
    Downloading artifacts tarball...
    Artifacts downloaded to artifacts.tgz
    + tar -xzf artifacts.tgz

    gzip: stdin: unexpected end of file

but it passed in most of time , ex. that in previous build
    https://taipei-jenkins-docker.redirectme.net/view/Daily-Auto-Sanity-Staging/job/sanity-3-testflinger-dell-bto-bionic-wasp-n5-v5-201812-26762/37/consoleFull
    + cp artifacts/submission.json submission.json.previous.
    cp: cannot stat 'artifacts/submission.json': No such file or directory
    + /bin/true
    + rm -rf 'artifacts*'
    + testflinger-cli artifacts bf8aef7a-05ab-412c-b1d3-91e459e7c3e3
    Downloading artifacts tarball...
    No artifacts tarball found for that job id.
    + sleep 30
    + testflinger-cli artifacts bf8aef7a-05ab-412c-b1d3-91e459e7c3e3
    Downloading artifacts tarball...

from build #38, "testflinger-cli artifacts" downloaded artifacts.tgz with success return code, but the artifacts.tgz is truncated.
Refer to jenkins script , it did waiting for job be done.

https://taipei-jenkins-docker.redirectme.net/view/Daily-Auto-Sanity-Staging/job/sanity-3-testflinger-dell-bto-bionic-wasp-n5-v5-201812-26762/configure
    "
    #testflinger-cli poll ${JOB_ID}

    TEST_STATUS=$(testflinger-cli results ${JOB_ID} |jq -r .test_status)

    cp artifacts/submission.json submission.json.previous.$DISTRO_IMAGE || /bin/true
    rm -rf artifacts*
    #retry getting the artifacts after a delay if it fails
    testflinger-cli artifacts ${JOB_ID} || (sleep 30 && testflinger-cli artifacts ${JOB_ID})
    "

So, "testflinger-cli artifacts" should check the checksum of artifacts.tgz then return success to make sure the downloaded tarball is correct.

Before this issue be fixed, the workaround could be check if artifacts.tgz as expect manually, if not then download again.

Revision history for this message
Paul Larson (pwlars) wrote :

Yes, there's a possible window where your job can be finished executing all the commands, but the artifacts tarball is not yet available on the server. This is because the agent automatically takes care of that part for you. In fact, it's also smart enough to retry it if the upload fails for any reason. At the time, I was thinking it would be better to return control to the user as soon as the test run is complete, and let the agent take care of all this in the background. But if you have something monitoring the test run and trying to collect the results (like jenkins, etc), you want to get those artifacts before finishing. The way I deal with this in our jenkins jobs is to do something like this:

testflinger-cli artifacts ${{JOB_ID}} || (sleep 30 && testflinger-cli artifacts ${{JOB_ID}})

I'm not sure about making poll hang indefinitely until the artifacts are confirmed to be uploaded - it's actually possible that the agent has moved on to processing other test jobs if that happens, but maybe we can try once and if it fails, warn the user that they may need to wait for their artifacts to be available for download. I'm not sure if that really solves your problem though, because I think you may still need to handle the possibility that your artifacts are not there yet. But it might be ok for 99% of the cases, because in reality, the upload rarely if ever fails on the first try.
what do you think?

Paul Larson (pwlars)
Changed in testflinger:
assignee: nobody → Paul Larson (pwlars)
Revision history for this message
Alex Tu (alextu) wrote :

Hi Paul,
" testflinger-cli artifacts ${{JOB_ID}} || (sleep 30 && testflinger-cli artifactstestflinger-cli artifacts ${{JOB_ID}} ${{JOB_ID}})"

Not help this issue, because that's already current way we did as I wrote in description.

The issue here is testflinger-cli artifacts ${{JOB_ID}} will success, but get a truncated artifacts.tgz.

So, I guess the real fix should be:
testflinger-cli artifacts ${{JOB_ID}} to check the checksum of artifacts.tgz after finish downloading.

Paul Larson (pwlars)
Changed in testflinger:
importance: Undecided → Medium
Revision history for this message
Paul Larson (pwlars) wrote :

So this might be fixed for your purposes, but it doesn't really check the checksum yet. I'd still like to add that so I'm going to leave it open but there were some changes a while back on the testflinger-agent side, which should mean that you don't exit from testflinger poll, until that artifact is uploaded completely. Because if anything goes wrong during the artifact upload, it will fall out of that and put it into a results queue to retry. It's not until later when the results json gets uploaded that the job will be marked "complete", and at that point polling will stop listening, so by then, the artifact *must* be uploaded, which means you shouldn't get a truncated artifact unless some other network problem interferes.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.