ARM builders connectivity issues cause the build to fail

Bug #2024181 reported by Nathan Teodosio
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
New
Undecided
Unassigned

Bug Description

Pull phases from third-parties (i.e. non-archive) very often fail on ARM Launchpad builders. I don’t have the exact numbers but that is certainly around 50% of the builds. Occasionally the builds fails without logs, and more rarely they are restarted without any intervention — the logs are wiped anew, but the build time isn't reset when this happens. These last cases don't really allow me to affirm it was a connectivity issue though, and I am just mentioning it for what it's worth.

I observe this problem for more than one year already; It requires me to open the failure log and see if it was connection failure, then retry it if yes.

Example errors are:

--->
Downloading 'chromium-114.0.5735.90.tar.xz' 0%
Sorry, an error occurred in Snapcraft:
("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
<---

--->
Cloning into '/build/chromium/parts/chromium/build/third_party/llvm'...
error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function.
<---

--->
Running git clone https://chromium.googlesource.com/external/github.com/llvm/llvm-project /build/chromium/parts/chromium/build/third_party/llvm
Cloning into '/build/chromium/parts/chromium/build/third_party/llvm'...
error: RPC failed; curl 56 GnuTLS recv error (-110): The TLS connection was non-properly terminated.
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
Failed.
CheckoutGitRepo failed.
<---

--->
2023-06-26 06:49:24 (140 KB/s) - Connection closed at byte 5501411. Retrying.

--2023-06-26 06:49:25-- (try: 2) https://nodejs.org/dist/v16.13.0/node-v16.13.0-linux-arm64.tar.gz
Connecting to 10.10.10.1:8222... connected.
OpenSSL: error:0A000126:SSL routines::unexpected eof while reading
Unable to establish SSL connection.
<---

--->
Failed to fetch package: The item '/root/.cache/snapcraft/download/libopus0_1.3.1-0.1build2_arm64.deb' could not be fetched: 407 Proxy Authentication Required [IP: 10.10.10.1 8222].
<---

and most common one:

--->
:: + craftctl default
None
Full execution log: '/root/.local/state/snapcraft/log/snapcraft-20230720-103153.207821.log'
Build failed
<---

The 49 KiB logs in https://launchpad.net/~nteodosio/+snap/arm-fix-buildgn or the small logs in https://launchpad.net/~nteodosio/+snap/arm-dont-optimize are some more examples.

Revision history for this message
Ines Almeida (ines-almeida) wrote :

Note: Potentially related in some way to https://bugs.launchpad.net/launchpad/+bug/2023961?

Revision history for this message
Nathan Teodosio (nteodosio) wrote : Re: [Bug 2024181] Re: ARM builders connectivity issues cause the build to fail

I don't think they are related, because this one affects the Launchpad
builders themselves and are only observable for snap builds, as only
those access the internet; In LP:2023961 the problem described is
uploading a orig tarball to PPA, the failure of which prevents any build
whatsoever from starting, and even when in the past I didn't have that
issue and managed to get builds there, the problem described in this
LP:2024181 never affected any builder because such builds do not access
the internet.

Revision history for this message
Ines Almeida (ines-almeida) wrote :

A few notes on this issue:
I have been trying to reproduce it for a while now, and don't seem to be able to. I
tried to stress our testing environments, and tried a few things in production, without success.

Our current best guess is that it might have been related to the load of the builders. I don't see particular peaks in the dates where the builds mentioned in this ticket failed, except for the 1st of June where there was a firewall issue. But it's hard to discard that was the reason.

Ideally we would set up performance tests for the builders.

Revision history for this message
Nathan Teodosio (nteodosio) wrote :

Thank you very much for taking a closer look into this issue, Ines.

 > Our current best guess is that it might have been related to the load
of the builders.

 From my side I've been observing this happening more or less uniformly
(i.e., the chance of failure per ARM build doesn't noticeably change)
for a long time.

 > I have been trying to reproduce it for a while now, and don't seem to
be able to. I tried to stress our testing environments, and tried a few
things in production, without success.

To be clear, I never observe this when downloading from Ubuntu archives.
Did your test involve requests to third-parties? Maybe it's actually
Google servers that are misbehaving, not the builders (and then I should
direct my report elsewhere)? Amin (in the Mozilla team) reports similar
observations for Firefox builds, but for him the rate of failure is much
smaller, so I'm unsure.

If you try testing with this one, can you reproduce the failure?

--->
wget
https://commondatastorage.googleapis.com/chromium-browser-official/chromium-114.0.5735.45.tar.xz
<---

Revision history for this message
Ines Almeida (ines-almeida) wrote :

I did try specifically with that chromium artifact after seeing all the "Connection Broken" errors came during that download.

I also tried building your artifact in our test environment a few times, and it never failed that or other downloads.

This is an odd one.

We still haven't completely ruled out that it might be related to builders load, but it might also be the case that Google servers were misbehaving.

Let us know if you keep getting that error.

Revision history for this message
Nathan Teodosio (nteodosio) wrote :
description: updated
Revision history for this message
Nathan Teodosio (nteodosio) wrote :

Actually I got some more but this could be a general instability as they fail to pull from the archive and I also got failure on x86 at about the same time.

--[x86]-->
Cloning into 'chromium'...
[26/Jun/2023:07:03:10 +0000] "CONNECT git.launchpad.net:443 HTTP/1.1" 407 1913 "-" "git/2.34.1"
fatal: unable to access 'https://git.launchpad.net/~nteodosio/chromium-browser/': Received HTTP code 407 from proxy after CONNECT
<---

--[arm64]-->
Cannot install all requested build packages: chrpath, cmake, default-jre-headless, elfutils, g++, gcc, git, gperf, gzip, libasound2-dev, libcap-dev, libcups2-dev, libcurl4-openssl-dev, libevdev-dev, libffi-dev, libicu-dev, libkrb5-dev, libnss3-dev, libpam0g-dev, libpci-dev, libpipewire-0.3-dev, libssl-dev, libsystemd-dev, libva-dev, libxshmfence-dev, libxss-dev, lsb-release, make, mesa-common-dev, ninja-build, pkg-config, python3-pkg-resources, python3-xcbgen, qtbase5-dev, quilt, sed, subversion, tar, wget, xcb-proto, yasm
<---

[1] https://launchpadlibrarian.net/674077435/buildlog_snap_ubuntu_jammy_amd64_unstage-linters_BUILDING.txt.gz
[2] https://launchpadlibrarian.net/674077632/buildlog_snap_ubuntu_jammy_arm64_chromium-snap-from-source-dev_BUILDING.txt.gz

Revision history for this message
Nathan Teodosio (nteodosio) wrote :

> more rarely they are restarted without any intervention — the logs are wiped anew, but the build time isn't reset when this happens.

As an example of this I just got https://launchpad.net/~chromium-team/+snap/chromium-snap-from-source-dev/+build/2150101, which took 318 min.

Revision history for this message
Nathan Teodosio (nteodosio) wrote :

This one took 45 hours. Certainly it is also a case of the build being silently reset, because there is absolutely no way the run captured in the log would take that long.

https://launchpad.net/~chromium-team/+snap/chromium-snap-from-source-dev/+build/2150102

By the way let me know if there are enough data points so I stop posting them here.

Revision history for this message
Ines Almeida (ines-almeida) wrote :

Hi Nathan!

Thanks for keeping us up to date with information.

The error that happened the most consistently when you first reported these was

--->
Downloading 'chromium-114.0.5735.90.tar.xz' 0%
Sorry, an error occurred in Snapcraft:
("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
<---

I recall seeing it in 4 different logs. Has this one particularly happened again?

I suspect that there is more than one issue going on here, and it's hard to separate what is what currently.

Regarding the job that took 45 hours, I would expect that any job that takes that long would fail with connection errors because the proxy token only lasts about 6 hours, meaning any connections after that token expire would fail...
But also, we had a few issues (now fixed) in the past 3-4 days that could potentially lead to builder being busy retrying builds, so I wonder if generally busy builders ===> your builder's connection issues. Those issues have been fixed now, so would be interesting to know if there is indeed a correlation here.

Revision history for this message
Nathan Teodosio (nteodosio) wrote :

Hi Inês!

 > I recall seeing it in 4 different logs. Has this one particularly
 > happened again?

This particular one I don't think so.

 > Regarding the job that took 45 hours, I would expect that any job
that takes that long would fail with connection errors because the proxy
token only lasts about 6 hours, meaning any connections after that token
expire would fail...

Although Chromium builds never take less than 6 h — on ARM it takes at
least 1 day —, as far as I can tell no connection is attempted after
this step is completed:

:: + python3 tools/clang/scripts/build.py --skip-checkout --bootstrap
--disable-asserts --pgo --without-android --without-fuchsia
--use-system-cmake --with-ml-inliner-model=

Which is rather early in a successful build[1].

So I believe you are right and there is more than one issue going on.

[1]
https://launchpadlibrarian.net/674561689/buildlog_snap_ubuntu_jammy_arm64_chromium-snap-from-source-stable_BUILDING.txt.gz

description: updated
Revision history for this message
Nathan Teodosio (nteodosio) wrote :

Still getting several of

--->
:: + craftctl default
None
Full execution log: '/root/.local/state/snapcraft/log/snapcraft-20230720-103153.207821.log'
Build failed
<---

accompanied by the mysterious silent resets.

https://launchpadlibrarian.net/678116449/buildlog_snap_ubuntu_jammy_arm64_chromium-snap-from-source-beta_BUILDING.txt.gz
https://launchpad.net/~chromium-team/+snap/chromium-snap-from-source-dev/+build/2173457/+files/buildlog_snap_ubuntu_jammy_armhf_chromium-snap-from-source-dev_BUILDING.txt.gz

Revision history for this message
Michał Sawicz (saviq) wrote :

I just had this happen with:

```
Cloning into 'snapcraft-mir-libs-e0d88dc3e098256c7080c32e7a45680d'...
error: RPC failed; curl 56 GnuTLS recv error (-110): The TLS connection was non-properly terminated.
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
Build failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/lpbuildd/target/build_snap.py", line 264, in run
    self.repo()
  File "/usr/lib/python3/dist-packages/lpbuildd/target/build_snap.py", line 181, in repo
    self.vcs_fetch(self.args.name, cwd="/build", env=env)
  File "/usr/lib/python3/dist-packages/lpbuildd/target/vcs.py", line 86, in vcs_fetch
    self.backend.run(cmd, cwd=cwd, env=full_env)
  File "/usr/lib/python3/dist-packages/lpbuildd/target/lxd.py", line 716, in run
    subprocess.check_call(cmd, **kwargs)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['lxc', 'exec', 'lp-jammy-arm64', '--env', 'LANG=C.UTF-8', '--env', 'SHELL=/bin/sh', '--env', 'http_proxy=http://10.10.10.1:8222/', '--env', 'https_proxy=http://10.10.10.1:8222/', '--env', 'GIT_PROXY_COMMAND=/usr/local/bin/lpbuildd-git-proxy', '--env', 'SNAPPY_STORE_NO_CDN=1', '--', '/bin/sh', '-c', "cd /build && linux64 git clone -n 'https://<email address hidden>/~mir-ci-bot/+git/snapcraft-mir-libs-e0d88dc3e098256c7080c32e7a45680d/' snapcraft-mir-libs-e0d88dc3e098256c7080c32e7a45680d"]' returned non-zero exit status 128.
```

While logs available, visible here:

https://github.com/MirServer/mir/actions/runs/6650829766/job/18071651595?pr=3095#step:3:1531

description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.