newton/ha: tripleo cluster fails to build

Bug #1660331 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

It happens on Newton CI jobs for HA scenario:
http://logs.openstack.org/16/426716/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-newton/bd89ac8/logs/postci.txt.gz#_2017-01-30_12_52_18_000

2017-01-30 12:52:18.000 | Error: /sbin/pcs cluster setup --name tripleo_cluster controller-0-tripleo-ci-a-foo controller-1-tripleo-ci-b-bar controller-2-tripleo-ci-c-baz --token 10000 returned 1 instead of one of [0]
2017-01-30 12:52:18.000 | Error: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster setup --name tripleo_cluster controller-0-tripleo-ci-a-foo controller-1-tripleo-ci-b-bar controller-2-tripleo-ci-c-baz --token 10000 returned 1 instead of one of [0]

Tags: ci
Revision history for this message
Emilien Macchi (emilienm) wrote :

https://www.diffchecker.com/xa89o6Nm is the packaging diff between failing & working jobs. I see nothing related to Pacemker. It's maybe a random failure, let's see if it's consistent.

Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (4.9 KiB)

So this seems to be something that happens rather rarely (I checked some other ovb-ha logs and have not yet seen an occurrence of this one), but it does seem like a real problem/race.
What seems to happen is the following (note the time 12:46:39 is slightly misleading. The action happened a little before but os-collect-config logs everything in a big chunk):
Jan 30 12:46:39 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Triggered 'refresh' from 2 events
Jan 30 12:46:39 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: Error: unable to destroy cluster#033[0m
Jan 30 12:46:39 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: controller-0-tripleo-ci-a-foo: Unable to authenticate to controller-0-tripleo-ci-a-foo - (HTTP error: 401), try running 'pcs cluster auth'#033[0m

So basically when puppet-pacemaker does this:
    ->
    exec {"Create Cluster ${cluster_name}":
      creates => '/etc/cluster/cluster.conf',
      command => "${::pacemaker::pcs_bin} cluster setup --name ${cluster_name} ${cluster_members_rrp_real} ${cluster_setup_extras_real}",
      unless => '/usr/bin/test -f /etc/corosync/corosync.conf',
      require => Class['::pacemaker::install'],
    }

pcs will actually call the destroy_cluster (in case it existed before) but it gets a 401 on the controller-0 node:
Notice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: controller-0-tripleo-ci-a-foo: Unable to authenticate to controller-0-tripleo-ci-a-foo - (HTTP error: 401), try running 'pcs cluster auth'

The corresponding pcsd log shows the following:
I, [2017-01-30T12:46:36.570051 #28804] INFO -- : Return Value: 0
I, [2017-01-30T12:46:36.570128 #28804] INFO -- : Successful login by 'hacluster'
::ffff:172.17.0.253 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1176
::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1145
::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1147
controller-0-tripleo-ci-a-foo.localdomain - - [30/Jan/2017:12:46:36 UTC] "POST /remote/auth HTTP/1.1" 200 36
::ffff:172.17.0.253 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1188
controller-2-tripleo-ci-c-baz.localdomain - - [30/Jan/2017:12:46:36 UTC] "POST /remote/auth HTTP/1.1" 200 36
- -> /remote/auth
- -> /remote/auth
::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "GET /remote/cluster_destroy HTTP/1.1" 401 24 0.0042
::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "GET /remote/cluster_destroy HTTP/1.1" 401 24 0.0044
controller-0-tripleo-ci-a-foo.localdomain - - [30/Jan/2017:12:46:36 UTC] "GET /remote/cluster_destroy HTTP/1.1" 401 24
- -> /remote/cluster_destroy

So even though we correctly did an auth ...

Read more...

Revision history for this message
Emilien Macchi (emilienm) wrote :
tags: removed: alert
Changed in tripleo:
milestone: ocata-rc1 → ocata-rc2
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.