bootstrap failing with gomaasapi 400 when updating MAAS images

Bug #1419041 reported by Larry Michel
42
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Unassigned

Bug Description

We had all boostrap failed for builds between ~8:32UTC and ~9:47UTC with error below. Looking in the maas log, I see that cluster registration is triggered 3 times at time that this happen. The bootstrap errors continue for an hour after the last registered message so it's not clear in what state maas server is in when these happen. I am attaching the maas logs.

==========================================================================
2015-02-06 09:47:39,075 [INFO] oil_ci.deploy.oil_deployer: Bootstrapping new juju environment
2015-02-06 09:47:39,076 [DEBUG] oil_ci.deploy.oil_deployer: Using environment constraints: tags=hw-ok
2015-02-06 09:47:40,093 [INFO] oil_ci.juju.client: Boostrapping new environment
Bootstrapping environment "maas"
Starting new instance for initial state server
Launching instance
WARNING no architecture was specified, acquiring an arbitrary node
WARNING no architecture was specified, acquiring an arbitrary node
WARNING no architecture was specified, acquiring an arbitrary node
Bootstrap failed, destroying environment
ERROR failed to bootstrap environment: cannot start bootstrap instance: gomaasapi: got error back from server: 400 BAD REQUEST ({"distro_series": ["'precise' is not a valid distro_series. It should be one of: '', 'windows/win2012hvr2', 'ubuntu/trusty'."]})
2015-02-06 09:48:10,367 [ERROR] oil_ci.juju.client: Calling "juju bootstrap" failed!
2015-02-06 09:48:10,369 [ERROR] oil_ci.cli: Deployment failed:
+ rc=1
+ echo 'Deployment returned: 1'
Deployment returned: 1
+ [[ 1 == 0 ]]
==========================================================================

On the maas server, I see these messages about the cluster registering at time of first failure.
==========================================================================
Feb 6 08:37:24 maas-trusty-back-may22 maas.api: [INFO] Request from user oil-slave-1 to acquire a node with constraints <QueryDict: {u'agent_name': [u'ad571136-afee-4950-8870-7892b9482e91'], u'zone': [u'default'], u'tags': [u'hw-ok']}>
Feb 6 08:37:24 maas-trusty-back-may22 maas.calls: [INFO] Starting task 'rndc_command' with args: ([u'reload'], False) {}
Feb 6 08:37:24 maas-trusty-back-may22 maas.calls: [INFO] Finished task 'rndc_command' with args: ([u'reload'], False) {}
Feb 6 08:37:24 maas-trusty-back-may22 maas.calls: [INFO] Finished task 'write_full_dns_config' with args: () {'zones': [<provisioningserver.dns.zoneconfig.DNSForwardZoneConfig object at 0x7f5df8263f50>, <provisioningserver.dns.zoneconfig.DNSReverseZoneConfig object at 0x7f5df82473d0>], 'callback': provisioningserver.tasks.rndc_command([u'reload'], False), 'upstream_dns': u'10.245.0.1', 'trusted_networks': u'10.245.0.0/18;'}
Feb 6 08:37:24 maas-trusty-back-may22 maas.rpc.clusters: [INFO] Cluster registered: OIL Cluster (037c960b-5b9f-4701-8366-eeda2c09d14e)
Feb 6 08:37:24 maas-trusty-back-may22 maas.calls: [INFO] Starting task 'write_full_dns_config' with args: () {'zones': [<provisioningserver.dns.zoneconfig.DNSForwardZoneConfig object at 0x7f5e0014b1d0>, <provisioningserver.dns.zoneconfig.DNSReverseZoneConfig object at 0x7f5e0014b790>], 'callback': provisioningserver.tasks.rndc_command([u'reload'], False), 'upstream_dns': u'10.245.0.1', 'trusted_networks': u'10.245.0.0/18;'}

There are no errors that I see around that time. For last 21 days, I do see that these happen periodically:

ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Feb 6"|wc -l
5
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Feb 5"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Feb 4"|wc -l
10
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Feb 3"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Feb 2"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Feb 1"|wc -l
5
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 31"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 30"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 29"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 28"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 27"|wc -l
15
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 26"|wc -l
5
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 25"|wc -l
1
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 24"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 23"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 22"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 21"|wc -l
9
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 20"|wc -l
13
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 19"|wc -l
0
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 18"|wc -l
5
ubuntu@maas-trusty-back-may22:~$ grep "registered" /var/log/maas/maas.log |grep "Jan 17"|wc -l
==========================================================================

Revision history for this message
Larry Michel (lmic) wrote :
description: updated
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Does the MAAS server have the precise images imported?

Does 12.04 show up on the MAAS page?

Changed in maas:
status: New → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

Yes it does, this from UI:

Ubuntu
Release:

    15.04
    14.10
    14.04 LTS
    13.10
    12.04 LTS

Architecture:

    amd64
    arm64
    armhf
    i386
    ppc64el

 Release Architecture Size Nodes deployed Last update
 14.10 i386 369.2 MB 0 Fri Feb 6 15:31:51 2015
 14.10 amd64 378.7 MB 0 Fri Feb 6 15:31:51 2015
 14.10 ppc64el 389.6 MB 0 Fri Feb 6 15:31:50 2015
 14.04 LTS i386 399.6 MB 0 Fri Feb 6 15:31:51 2015
 14.04 LTS ppc64el 435.0 MB 0 Fri Feb 6 15:31:52 2015
 14.04 LTS amd64 421.1 MB 134 Fri Feb 6 15:31:53 2015
 12.04 LTS i386 469.3 MB 0 Fri Feb 6 15:31:53 2015
 12.04 LTS amd64 507.8 MB 65 Fri Feb 6 15:31:53 2015
Generated Images
 Name Architecture Size Nodes deployed Last update
 Windows "Hyper-V Server 2012 R2" delete amd64 1.6 GB 1 Jan. 29, 2015, 5:01 p.m.

Changed in maas:
status: Incomplete → New
Revision history for this message
Greg Lutostanski (lutostag) wrote :

yes it does.

My guess is it has to do with updating precise images. (it updates the images every 20 mins or at least checks, one finished at the time when it was fixed:
Feb 6 09:38:02 maas-trusty-back-may22 maas.bootresources: [INFO] Finished importing of boot images from 1 source(s).

and looks like the one that broke it was finished at:
Feb 6 08:35:50 maas-trusty-back-may22 maas.import-images: [INFO] Finished importing boot images.

Not sure what is happening internally with the images or there was a mixup with the simplestreams, we need to make sure that image updates do not bring installs to a halt (unless it was a one-time streams.canonical.com update mess-up).

If there was a mismatch between bootresources and images could this perhaps cause it?

Revision history for this message
Larry Michel (lmic) wrote :

This is maas version:
ubuntu@maas-trusty-back-may22:~$ dpkg -l |grep maas
ii maas 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS server all-in-one metapackage
ii maas-cli 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS command line API tool
ii maas-cluster-controller 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS server cluster controller
ii maas-common 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS server common files
ii maas-dhcp 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS DHCP server
ii maas-dns 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS DNS server
ii maas-proxy 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS Caching Proxy
ii maas-region-controller 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS server complete region controller
ii maas-region-controller-min 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS Server minimum region controller
ii python-django-maas 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS server Django web framework
ii python-maas-client 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS python API client
ii python-maas-provisioningserver 1.7.1+bzr3341-0ubuntu1~trusty1 all MAAS server provisioning libraries

Larry Michel (lmic)
description: updated
Changed in maas:
importance: Undecided → Critical
tags: added: oil-bug-1372407
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

There are some tracebacks in pserv.log from around the same time - Feb 6 08:37

http://paste.ubuntu.com/10178354/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

And tracebacks in maas-django.log:

http://paste.ubuntu.com/10178391/

Revision history for this message
Raphaël Badin (rvb) wrote :

Any chance you can share the oops reports that were created around the same time (if any)?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's the oops from that day.

Changed in maas:
status: New → In Progress
assignee: nobody → Raphaël Badin (rvb)
Revision history for this message
Raphaël Badin (rvb) wrote :

At this stage, my only hypothesis is that the boot resources "disappeared" for a moment when the images where being imported. I'm trying to reproduce.

Could you please give me the result of running: http://paste.ubuntu.com/10408031/

Revision history for this message
Greg Lutostanski (lutostag) wrote :

$ sudo maas-region-admin shell
[sudo] password for ubuntu:
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from maasserver.models.bootresource import BootResource
>>> sorted(BootResource.objects.get_usable_architectures())
[u'amd64/generic', u'amd64/hwe-p', u'amd64/hwe-q', u'amd64/hwe-r', u'amd64/hwe-s', u'amd64/hwe-t', u'amd64/hwe-u', u'i386/generic', u'i386/hwe-p', u'i386/hwe-q', u'i386/hwe-r', u'i386/hwe-s', u'i386/hwe-t', u'i386/hwe-u', u'ppc64el/generic', u'ppc64el/hwe-p', u'ppc64el/hwe-q', u'ppc64el/hwe-r', u'ppc64el/hwe-s', u'ppc64el/hwe-t', u'ppc64el/hwe-u']
>>>

Raphaël Badin (rvb)
Changed in maas:
status: In Progress → Triaged
assignee: Raphaël Badin (rvb) → nobody
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I've been collecting the output of 'maas admin boot-images read' and 'maas admin boot-resources read' every five minutes for a week and a half now and both have always included all of the expected entries.

Changed in maas:
milestone: none → next
Revision history for this message
Larry Michel (lmic) wrote :

Seeing a recreate for this one with trusty this time listed as invalid:

'3':
    agent-state-info: 'gomaasapi: got error back from server: 400 BAD REQUEST ({"osystem":
      ["''ubuntu'' is not a valid osystem. It should be one of: ''''."], "distro_series":
      ["''ubuntu/trusty'' is not a valid distro_series. It should be one of: ''''."]})'
    containers:

Revision history for this message
Larry Michel (lmic) wrote :

The list of releases is also empty.

Revision history for this message
Larry Michel (lmic) wrote :

Seeing this with Maas 1.8b1 as well:

  '6':
    agent-state-info: 'gomaasapi: got error back from server: 400 BAD REQUEST ({"osystem":
      ["''ubuntu'' is not a valid osystem. It should be one of: ''''."], "distro_series":
      ["''ubuntu/trusty'' is not a valid distro_series. It should be one of: ''''."]})'
    containers:
      6/lxc/0:
        instance-id: pending
        series: trusty
    instance-id: pending
    series: trusty

Changed in maas:
milestone: next → 1.8.0
Revision history for this message
Andres Rodriguez (andreserl) wrote :

17:36 < Beret> roaksoax, we're seeing this - seen it? http://paste.ubuntu.com/11085319/
17:37 < Beret> I've seen multiple people mention it
17:38 < dpb> roaksoax: what I get on the gui: https://www.dropbox.com/s/6fk2lfplsndr378/maas-importing.png?dl=0

summary: - boostrap failing with gomaasapi 400
+ boostrap failing with gomaasapi 400 when updating MAAS images
Changed in maas:
status: Triaged → Confirmed
David Britton (dpb)
tags: added: landscape
Revision history for this message
Blake Rouse (blake-rouse) wrote : Re: boostrap failing with gomaasapi 400 when updating MAAS images

The supported images come from all of the clusters, not the region. A RPC call is made to all clusters to get information about the releases available and this is based on the images that each cluster has at that time. That information is also cached on the cluster side, and is only updated once an import is finished or when the cluster is started.

So for this to occur the cache on the cluster side would need to change before an import is done, which is not the case. It is only changed once the import on the cluster is finished. This entire process is also in a lock, so the cluster can only be updating the images then updating the cache exactly in that order, one at a time.

So a quick fix of disabling the auto syncing of images from the cluster to the region cannot be done, as the cluster needs to stay in-sync with the region. The cluster only updates the images on disk, if the content in the region is different then what it currently has. So the auto-updating is very fast if nothing has changed. In that case the cache will also be updated at the end of that check. The cache is replaced in one operating, it is not modified in anyway. The contents of the cache is completely updated when an import has finished, so should not be possible for the RPC call that is requesting the available images to get partial information.

Now there are other things that could cause the image to not show there but it is actually present. A call is made to the UbuntuOS.is_release_supported in provisioningserver/drivers/osystems/ubuntu.py. This uses distro_info to check that the release is a supported ubuntu distribution, if it returns False then that release will not show up over the API as a valid option. We should check that this method always returns True for releases, and make sure that the logic in that method and the library being used is not doing something unexpected.

There is also the case when the cluster is down, it will fail to return any releases. So that only fits in the error cases when no releases are present in the supported release.

We could make a change to the architecture and get the available releases from the region database instead of the cluster. I overall think this is a better idea. I don't know if that would fully eliminate this issue as the API should still check that the cluster has that image before deploying. If it does not then it should fail as the machine will fail to boot and install.

Revision history for this message
Gavin Panella (allenap) wrote :

> We could make a change to the architecture and get the available
> releases from the region database instead of the cluster. I overall
> think this is a better idea. I don't know if that would fully
> eliminate this issue as the API should still check that the cluster
> has that image before deploying. If it does not then it should fail as
> the machine will fail to boot and install.

Getting image information from the region database might be a valid
approach, but we need to be mindful because it has *no* bearing on what
the overall MAAS installation can actually provision at that moment in
time.

Asking the clusters gives us that answer.

Some more general thoughts:

- The PostgreSQL database and the clusters are all data sources for the
  running of this distributed system. PostgreSQL is very reliable and
  the clusters have not yet proven themselves as reliable. However, we
  need the clusters to be more reliable. Working around a reliability
  issue instead of addressing it does not serve MAAS as as whole.

- Asking the clusters and finding that they're unavailable or saying
  that they've got nothing is *useful information*. It means we can't
  provision right now. Again, working around that by going to the
  region's database does not serve MAAS as a whole; it only means we'll
  have to report later on that we can't provision. Even if it's only a
  transient failure, we should try to fix *that* instead of papering
  over it.

Revision history for this message
Raphaël Badin (rvb) wrote :

> We could make a change to the architecture and get the available releases from the region database instead of the cluster.
> I overall think this is a better idea. I don't know if that would fully eliminate this issue as the API should still check that the
> cluster has that image before deploying. If it does not then it should fail as the machine will fail to boot and install.

I agree with this. The architecture we have now where the "truth" is stored sometimes in the region, sometimes in the cluster is confusing and generates a lot of friction in practice (even if it seems to make sense from an architectural point of view). The clusters are complete slaves to the region: they just import what they are told to import: they should just do that and send an ack to the region to say: "I'm up-to-date now".

Now, it's probably too late to do this this cycle… so we need to continue looking for the root cause for this bug.

Revision history for this message
Raphaël Badin (rvb) wrote :

This code in clusterrpc/ossytems.py http://paste.ubuntu.com/11094886/ looks wrong to me: it *ignores* all the errors that can happen when querying the clusters for the supported OSes. We should instrument this to see if there isn't any error there that gets suppressed.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

The result of the RPC call is a dictionary of all the operating systems and the releases, an error in clusterrpc/osystems.py would not cause one release to be missing. It would cause the entire information to be missing. Which could be the case in the errors where to osystems or releases are shown. But in the case where one release is missing I don't think this has any affect.

I think the is_release_supported on UbuntuOS might be the cause, but this was just me having a glance last night.

Revision history for this message
Raphaël Badin (rvb) wrote :

> I think the is_release_supported on UbuntuOS might be the cause, but this was just me having a glance last night.

As far as I can see, this uses distro_info which, in turn, uses static data from /usr/share/distro-info/ubuntu.csv. I don't really see how this can vary from one call to the next.

Revision history for this message
Gavin Panella (allenap) wrote :

> I agree with this. The architecture we have now where the "truth" is
> stored sometimes in the region, sometimes in the cluster is confusing
> and generates a lot of friction in practice (even if it seems to make
> sense from an architectural point of view). The clusters are complete
> slaves to the region: they just import what they are told to import:
> they should just do that and send an ack to the region to say: "I'm
> up-to-date now".

No code is needed to get the effect you describe.

* Consulting the region's database gives you one "truth". It's useful
  information, but it's not the whole truth.

* Consulting the clusters gives you another "truth": what can be
  provisioned right now.

The clusters don't send an ack to the region because they don't need to;
we can ask the cluster what it's got and compare it to the region's
knowledge to see if it's up-to-date.

Compare that to a flag in the database:

1. A boolean. Nodes should only be provisioned via this cluster if it's
   TRUE; the cluster cannot be used to provision when the flag is FALSE.

2. A timestamp, say a copy of the the most recent modification date of
   records in the boot images table at the moment that the last
   synchronisation began. This would allow us to make provisioning
   decisions based on which images are definitely available and those
   which may not be.

3. A JSON field (or other structure-like field) that records all the
   images that are available on a cluster. This would then be, in
   essence, a cache of the RPC call.

#1, a boolean, is too coarse-grained, and therefore not really a goer.

#2 seems adequate, but it's extra work to calculate what's available
compared to what we have now.

#3 gives us all the information we need, but it too is extra work, and
it is also a cache, with all the attendant problems that implies. The
cost of going to the the cluster is meant to be comparable to asking the
region's database - it has been designed that way - so the only
advantage here is that we can see what images were copied at some point
in the past onto a cluster *that may not be available*, i.e. for a
cluster that cannot be used for provisioning.

Revision history for this message
David Britton (dpb) wrote :

MAAS log from +/- 1h around the incident

Revision history for this message
Raphaël Badin (rvb) wrote :

As David said in his original bug report, the failure happened while the *region* was importing the images:

May 11 22:25:36 OrangeBox13 maas.bootresources: [INFO] Started importing of boot images from 1 source(s).
May 11 22:27:45 OrangeBox13 maas.bootresources: [INFO] Importing images from source: http://maas.ubuntu.com/images/ephemeral-v2/daily/
May 11 22:27:45 OrangeBox13 maas.bootsources: [INFO] Updated boot sources cache.

May 11 22:29:53 (corrected timestamp: MAAS server is BST and Juju client is on UTC, original timestamp 21:29:53) job-handler-1 ERROR failed to bootstrap environment: cannot start bootstrap instance: gomaasapi: got error back from server: 400 Bad Request ({"distro_series": ["'trusty' is not a valid distro_series. It should be one of: '', 'ubuntu/precise'."]})

May 11 22:49:26 OrangeBox13 maas.bootresources: [INFO] Finished importing of boot images from 1 source(s).

Now, the region downloading the images shouldn't affect what's available on the cluster in any way… this needs more investigation…

Revision history for this message
Raphaël Badin (rvb) wrote :

Last hypothesis is this (allenap is working on trying to reproduce the pb in a unit test right now):
While the region is syncing the images and committing stuff into the database step by step (not in a transactional fashion), the cluster syncs and gets a partial result from the region simplestreams endpoint (because the region import is still going on). Then, and up until the next cluster sync, the image list on the cluster will be broken.

This is consistent with what we've seen in the report from landscape: region syncing, cluster syncing (while the region is still syncing), cluster sync finishes, bug happens, region finishes syncing.

This is also consistent with the fact that the cluster seems to have a *partial* list of of images at the time the bug happens.

Changed in maas:
assignee: nobody → Gavin Panella (allenap)
summary: - boostrap failing with gomaasapi 400 when updating MAAS images
+ bootstrap failing with gomaasapi 400 when updating MAAS images
Raphaël Badin (rvb)
Changed in maas:
assignee: Gavin Panella (allenap) → nobody
assignee: nobody → Raphaël Badin (rvb)
Revision history for this message
Raphaël Badin (rvb) wrote :

We haven't been able to successfully fix this bug so far. We have a suspicion (bad interaction between the image download code and the simplestreams endpoint) but no definitive evidence. As of 1.8 beta7, we've added log statements in the code that, hopefully, will help us diagnose and fix the problem when it happens again. If it happens, please collect the logs (/var/log/maas/*) and upload them to this bug.

Changed in maas:
assignee: Raphaël Badin (rvb) → nobody
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We haven't seen this in a while, fwiw.

Gavin Panella (allenap)
Changed in maas:
status: Confirmed → Triaged
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Since we haven't this in a while I think it may have been fixed indirectly. Marking it incomplete for now.

Changed in maas:
status: Triaged → Incomplete
David Britton (dpb)
tags: added: kanban-cross-team
tags: removed: kanban-cross-team
Revision history for this message
Gavin Panella (allenap) wrote :

I'm assuming that this has been fixed by another change.

Changed in maas:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.