Buildd-manager should deal with transient communication failures with builders
Bug #369109 reported by
Celso Providelo
This bug affects 2 people
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Fix Released
|
High
|
Julian Edwards |
Bug Description
As described in bug #343683 and bug #31546, builders are subjected to transient communication failures and shouldn't be immediately marked as NOT_OK when it happens.
Instead we should allow an acceptable period of unavailability before excluding the builder from the poll, this way builders victims of network hiccups, which have continue with their jobs won't mistakenly be reset.
On the flip-side, builders excluded from the poll should be re-probed periodically, so they are automatically made available again once whatever was preventing them to build was fixed.
Related branches
lp://qastaging/~julian-edwards/launchpad/buildd-failure-counting
- Jonathan Lange (community): Approve
- Stuart Bishop (community): Approve (db)
- Robert Collins: Pending (db) requested
- Diff: 0 lines
tags: | added: buildd-manager |
Changed in soyuz: | |
assignee: | Celso Providelo (cprov) → nobody |
Changed in soyuz: | |
status: | Triaged → In Progress |
assignee: | nobody → Julian Edwards (julian-edwards) |
milestone: | pending → none |
Changed in soyuz: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
Today we had this in the log:
2010-04-15 23:53:23+0100 [-] Starting scanning cycle. gourd.buildd: 8221/ -- timed out .net/codelines/ soyuz-productio n-rev-9191/ lib/lp/ buildmaster/ model/builder. py", line 205, in updateBuilderStatus checkSlaveAlive () .net/codelines/ soyuz-productio n-rev-9191/ lib/lp/ buildmaster/ model/builder. py", line 320, in checkSlaveAlive echo("Test" )[0] != "Test": python2. 5/xmlrpclib. py", line 1147, in __call__ send(self. __name, args) python2. 5/xmlrpclib. py", line 1437, in __request self.__ verbose python2. 5/xmlrpclib. py", line 1185, in request python2. 5/httplib. py", line 1199, in getreply getresponse( ) python2. 5/httplib. py", line 928, in getresponse python2. 5/httplib. py", line 385, in begin python2. 5/httplib. py", line 343, in _read_status python2. 5/socket. py", line 331, in readline
2010-04-15 23:56:46+0100 [-] Disabling builder: http://
2010-04-15 23:56:46+0100 [-] Traceback (most recent call last):
2010-04-15 23:56:46+0100 [-] File "/srv/launchpad
2010-04-15 23:56:46+0100 [-] builder.
2010-04-15 23:56:46+0100 [-] File "/srv/launchpad
2010-04-15 23:56:46+0100 [-] if self.slave.
2010-04-15 23:56:46+0100 [-] File "/usr/lib/
2010-04-15 23:56:46+0100 [-] return self.__
2010-04-15 23:56:46+0100 [-] File "/usr/lib/
2010-04-15 23:56:46+0100 [-] verbose=
2010-04-15 23:56:46+0100 [-] File "/usr/lib/
2010-04-15 23:56:46+0100 [-] errcode, errmsg, headers = h.getreply()
2010-04-15 23:56:46+0100 [-] File "/usr/lib/
2010-04-15 23:56:46+0100 [-] response = self._conn.
2010-04-15 23:56:46+0100 [-] File "/usr/lib/
2010-04-15 23:56:46+0100 [-] response.begin()
2010-04-15 23:56:46+0100 [-] File "/usr/lib/
2010-04-15 23:56:46+0100 [-] version, status, reason = self._read_status()
2010-04-15 23:56:46+0100 [-] File "/usr/lib/
2010-04-15 23:56:46+0100 [-] line = self.fp.readline()
2010-04-15 23:56:46+0100 [-] File "/usr/lib/
2010-04-15 23:56:46+0100 [-] data = recv(1)
2010-04-15 23:56:46+0100 [-] timeout: timed out
It seems as though there are two competing ways of timing stuff out. buildmaster/ manager. py (QueryWithTimeo utProtocol) buildmaster/ model/builder. py (TimeoutTransport)
1. the code in lib/lp/
2. lib/lp/
Different actions seem to cause timeouts in each of these. This is crap.
It also seems as though the updateBuilderSt atus() should catch the above timeout exception. When it doesn't it will produce the traceback as above and leave the builder disabled but with the build still on it.