Android builds are failing due to android.git.linaro.org timeouts

Bug #1052502 reported by Deepti B. Kalakeri
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Android
Fix Released
Critical
Paul Sokolovsky
Linaro Android Infrastructure
Fix Released
Critical
Paul Sokolovsky

Bug Description

Android builds are failing due to android.git.linaro.org timeouts.

Snippet of the failure is pasted below:

 * [new branch] tools_r7 -> aosp/tools_r7
 * [new branch] tools_r8 -> aosp/tools_r8
 * [new branch] tools_r9 -> aosp/tools_r9
fatal: The remote end hung up unexpectedly
fatal: protocol error: bad pack header
From git://android.git.linaro.org/platform/prebuilts/qemu-kernel
 * [new branch] ics-plus-aosp -> aosp/ics-plus-aosp
 * [new branch] jb-dev -> aosp/jb-dev
 * [new branch] jb-release -> aosp/jb-release
 * [new branch] master -> aosp/master
 * [new branch] tools_r20 -> aosp/tools_r20
 * [new tag] android-4.1.1_r1 -> android-4.1.1_r1
 * [new tag] android-4.1.1_r1.1 -> android-4.1.1_r1.1
 * [new tag] android-4.1.1_r2 -> android-4.1.1_r2
 * [new tag] android-4.1.1_r3 -> android-4.1.1_r3
 * [new tag] android-4.1.1_r4 -> android-4.1.1_r4
 * [new tag] android-cts-4.1_r1 -> android-cts-4.1_r1
 * [new tag] android-sdk-adt_r20 -> android-sdk-adt_r20
error: Cannot fetch kernel/panda
fatal: The remote end hung up unexpectedly
From git://android.git.linaro.org/platform/prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.6
 * [new branch] ics-plus-aosp -> aosp/ics-plus-aosp
 * [new branch] jb-dev -> aosp/jb-dev
 * [new branch] jb-release -> aosp/jb-release
 * [new branch] master -> aosp/master
 * [new branch] tools_r20 -> aosp/tools_r20
 * [new tag] android-4.1.1_r1 -> android-4.1.1_r1
 * [new tag] android-4.1.1_r1.1 -> android-4.1.1_r1.1
 * [new tag] android-4.1.1_r2 -> android-4.1.1_r2
 * [new tag] android-4.1.1_r3 -> android-4.1.1_r3
 * [new tag] android-4.1.1_r4 -> android-4.1.1_r4
 * [new tag] android-cts-4.1_r1 -> android-cts-4.1_r1
 * [new tag] android-sdk-adt_r20 -> android-sdk-adt_r20
fatal: The remote end hung up unexpectedly
error: Cannot fetch platform/sdk

error: Exited sync due to fetch errors
++ infrastructure_error
++ echo 'Caught infrastructure error - finishing build with '\''Not Built'\'' status'
Caught infrastructure error - finishing build with 'Not Built' status
++ exit 123
Build step 'Execute shell and set build status' changed build result to NOT_BUILT
Build step 'Execute shell and set build status' marked build as failure
Archiving artifacts
Recording fingerprints
Finished: NOT_BUILT

Here are the links to the jobs log which failed
https://android-build.linaro.org/jenkins/job/linaro-android_galaxynexus-jb-gcc47-aosp-blob/63/consoleText
https://android-build.linaro.org/jenkins/job/linaro-android_panda-jb-gcc47-tilt-tracking-blob-tests/37/consoleText

Changed in linaro-android-infrastructure:
importance: Undecided → Critical
milestone: none → 2012.09
status: New → Triaged
Changed in linaro-android-infrastructure:
importance: Critical → High
Changed in linaro-android-infrastructure:
importance: High → Critical
Revision history for this message
Данило Шеган (danilo) wrote :

IS tells me we hit peak loads of 30 at around 0300 and 0500 and of 50 at around 0800-0900 UTC times. I am trying to get someone to watch over potential culprits tomorrow.

Revision history for this message
Данило Шеган (danilo) wrote :

Zach also started https://android-build.linaro.org/builds/~linaro-android/galaxynexus-jb-gcc47-aosp-blob/#build=64 now that we don't have as much load to ensure it is the slowness/load that's biting us.

Revision history for this message
Deepti B. Kalakeri (deeptik) wrote :

Recording the jobs that I came across which are failing due to android.git.linaro.org time outs that I felt would be helpful for testing once the problem is fixed.

linaro-android_panda-jb-gcc47-tilt-tracking-blob #58 (not built)
linaro-android_panda-jb-gcc47-tilt-stable-blob
linaro-android_panda-ics-gcc47-omapzoom-stable-blob
linaro-android_panda-ics-gcc47-tilt-stable-blob
linaro-android_galaxynexus-jb-gcc47-aosp-blob
linaro-android_snowball-ics-gcc47-igloo-stable-blob
linaro-android_snowball-ics-gcc47-igloo-tracking-blob
linaro-android_panda-jb-gcc47-tilt-tracking-blob-tiny
linaro-android_snowball-ics-gcc46-igloo-tracking-blob
linaro-android_vexpress-jb-gcc47-armlt-tracking-open
liuyq0307_jb-panda-stable
linaro-android_panda-jb-gcctrunk-tilt-tracking-blob

Revision history for this message
Georgy Redkozubov (gesha) wrote :

Last build https://android-build.linaro.org/builds/~linaro-android/galaxynexus-jb-gcc47-aosp-blob/#build=69 failed with the same reason but the peak load of android.git.linaro.org was 9.3 at 5:21am right before build failed.

Revision history for this message
Alexander Sack (asac) wrote :

linaro-android project should consider moving their manifests to use http:// which is a much better scalable hosting solution we provide.

Revision history for this message
Alexander Sack (asac) wrote :

please try applying this manifest to all builds.

Fathi Boudra (fboudra)
Changed in linaro-android:
importance: Undecided → Critical
milestone: none → 12.09
Revision history for this message
Alexander Sack (asac) wrote :

seems there is no real solution; we could do a manual update of the seed tarball to include more. or we can manually reschedule them in a way that they don't interfere

Revision history for this message
Данило Шеган (danilo) wrote :

As mentioned in the email thread, we don't have "non-smart" HTTP for android.git.linaro.org yet. I've asked for it to be set-up (see https://rt.linaro.org/Ticket/Display.html?id=608). For "non-smart" HTTP git access on git.linaro.org, one should use http://git.linaro.org/git-ro/ as the base URL (https://rt.linaro.org/Ticket/Display.html?id=607 to update links on gitweb).

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

During today's investigation, following issues were found/adjustments made:

1. Some builds were found to use very high values of SYNC_JOBS. So, SYNC_JOBS is made non-overridable from build configs, such option was useful during initial setup of the service to use by build engineers, but no longer needed and causes issues in production run.

2. Default SYNC_JOBS values was lower to 5 from 8.

3. Build timeouts were increased to 3.5hrs.

So far, the results of these changes are optimistic (except that we hit lp:1055546, which broke initial run of builds with these new settings).

Changed in linaro-android-infrastructure:
milestone: 2012.09 → 2012.10
assignee: nobody → Paul Sokolovsky (pfalcon)
Zach Pfeffer (pfefferz)
Changed in linaro-android:
assignee: nobody → Paul Sokolovsky (pfalcon)
status: New → Triaged
Fathi Boudra (fboudra)
Changed in linaro-android:
milestone: 12.09 → 12.10
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Even with 3.5hrs build timeout, we still have few builds fail due to it:

https://android-build.linaro.org/jenkins/view/All/job/berolinux_nexus7-jb-gcc47-aosp-blob/9/

That build spent 23min in seed download + 14min in git checkout, total 37min, which is not that *much* higher than baseline of 20min. So it seems, JB builds are what require more time.

Still, we have https://android-build.linaro.org/jenkins/view/All/job/linaro-android_origen-jb-gcc47-samsunglt-tracking-blob/18/ which build in whopping 2h12m.

So, dispersion of build times is too big for some reason.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Ok, comparing 2 builds

https://android-build.linaro.org/jenkins/view/All/job/linaro-android_origen-jb-gcc47-samsunglt-tracking-blob/18/
jenkins time 2h12m (132m)
real 7m36.007s
real 7m2.529s
real 6039.54 (101m)
rest 16.5m

https://android-build.linaro.org/jenkins/view/All/job/linaro-android_origen-jb-gcc47-samsunglt-tracking-blob/22/
jenkins time 3h17m (197m)
real 26m2.595s
real 10m43.538s
real 6740.36 (112m)
rest 48m

So, we have varying time on all 4 stages of build: seed download, git checkout, compilation, publishing. Of these, publishing provides the biggest absolute variation - 30mins!

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Considering fixed.

Changed in linaro-android-infrastructure:
status: Triaged → Fix Committed
Changed in linaro-android:
status: Triaged → Fix Committed
Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Status update: there're still cases of build timeouts. It appears that seed has gone stale/sideways (i.e. simply missed few big projects), so I'm currently regenerating it from scratch and testing.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

From mail:

Further status update on this matter:

Even after adjustments described above, we had few occasions of builds
failing due to git errors on repo sync. So, my idea was that the seed
grew too stale and "sidetracked" compared to what latest manifests we're
working with contain. So, I generated a new seed over last week, made
some test builds and in made it default just before the weekend. As
some stats, the new seed is 12Gb, while old one was 16Gb. So, some
"cruft" was dropped, but new seed may be not ideal in case old releases
may need to be rebuilt, so old seed is available as
http://android-build.linaro.org/seed/uniseed1.tar.gz just in case.

Otherwise, I'm closing
https://bugs.launchpad.net/linaro-android-infrastructure/+bug/1052502

Changed in linaro-android:
status: Fix Committed → Fix Released
Changed in linaro-android-infrastructure:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.