[swarm] custom graph doesn't start

Bug #1626072 reported by Andrey Lavrentyev
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Alexander Kurenyshev

Bug Description

Detailed bug description:
Custom graph doesn't fire

Swarm failure: https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.custom_graph/68/testReport/(root)/two_custom_graphs_interfere/

So we got a test failure:
DevopsCalledProcessError: Command 'ls /tmp/yaql_task_on_all_nodes' returned exit code 2 while expected [0]
 STDERR:
ls: cannot access /tmp/yaql_task_on_all_nodes: No such file or directory

Steps to reproduce:
Execute 'two_custom_graphs_interfere' test with the following steps:
1. Revert snapshot 'pre_provision_ubuntu_slaves_3'
2. Create cluster
3. Add 1 node with controller role
4. Add 1 node with compute role
5. Add 1 node with storage role
6. Provision cluster
7. Upload 'custom_graph' tasks to release
8. Upload 'yaql_graph' tasks to release
9. Run 'custom_graph' deployment.
10. Run 'yaql_graph' deployment.
11. Verify that 'yaql_graph' tasks are called on controller <<< FAILURE
12. Verify that 'yaql_graph' tasks are called on compute
13. Verify that 'yaql_graph' tasks are called on cinder
14. Create snapshot `two_custom_graphs_interfere`

Expected results:
yaql_graph is executed on the node

Actual result:
yaql_graph doesn't start

Impact:
Swarm failure

Description of the environment:

9.1 snapshot #287

[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 495
cat /etc/fuel_build_number:
 495
cat /etc/fuel_release:
 9.1
cat /etc/fuel_openstack_version:
 mitaka-9.0

UBUNTU_MIRROR_ID=ubuntu-2016-09-14-213640
CENTOS_MIRROR_ID=centos-7.2.1511-2016-05-31-083834
MOS_UBUNTU_MIRROR_ID=9.0-2016-09-20-102322
MOS_CENTOS_OS_MIRROR_ID=os-2016-06-23-135731
MOS_CENTOS_PROPOSED_MIRROR_ID=proposed-2016-09-20-102322
MOS_CENTOS_UPDATES_MIRROR_ID=updates-2016-06-23-135916
MOS_CENTOS_HOLDBACK_MIRROR_ID=holdback-2016-06-23-140047
MOS_CENTOS_HOTFIX_MIRROR_ID=hotfix-2016-09-16-084322
MOS_CENTOS_SECURITY_MIRROR_ID=security-2016-06-23-140002

Logs: https://drive.google.com/open?id=0B5HPBFb7K7gXdndiVG5uTWdpa3c

Swarm notes: The first identical failure is found on Monday (19-Sep). No similar failures are found before for this particular test.

Changed in fuel:
assignee: nobody → Ilya Kutukov (ikutukov)
milestone: none → 9.1
description: updated
description: updated
Revision history for this message
Ilya Kutukov (ikutukov) wrote :

There is output in app.log that is telling that the task was skipped based on YAQL expression added($.nodes).uid

2016-09-21 13:45:22.742 DEBUG [7fc4b10a3880] (task_serializer) evaluate yaql expression: $.uid in added($.nodes).uid
2016-09-21 13:45:22.745 DEBUG [7fc4b10a3880] (task_serializer) Task custom_task_on_all_nodes is skipped by condition.
2016-09-21 13:45:22.755 DEBUG [7fc4b10a3880] (task_serializer) Task custom_task_on_all_nodes is skipped by condition.

Revision history for this message
Andrey Lavrentyev (alavrentyev) wrote :

Well, took a look at the previous runs of this test. Looks like a test issue now and previous success test executions were due to the following check skips which didn't fail the test:

<<< 11. Verify that 'yaql_graph' tasks are called on controller >>>
2016-09-14 00:08:49 - INFO ssh_manager.py:137 -- SSHManager: Check for current connection fails. Try to reconnect
2016-09-14 00:08:50 - INFO base_test_case.py:135 --
<<< 12. Verify that 'yaql_graph' tasks are called on compute >>>
2016-09-14 00:08:50 - INFO ssh_manager.py:137 -- SSHManager: Check for current connection fails. Try to reconnect
2016-09-14 00:08:50 - INFO base_test_case.py:135 --
<<< 13. Verify that 'yaql_graph' tasks are called on cinder >>>
2016-09-14 00:08:50 - INFO ssh_manager.py:137 -- SSHManager: Check for current connection fails. Try to reconnect

Revision history for this message
Ilya Kutukov (ikutukov) wrote :

fuel2 graph execute -t yaql_graph -e 1 --force

| 18 | ready | deploy | None | 1 | {} | False | 100 |
| 19 | ready | deployment | yaql_graph | 1 | {} | False | 100 |
+----+--------+------------+--------------+---------+--------+---------+----------+

Everything fine with --force

+--------------------------+---------+--------+----------------------------+----------------------------+
| task_name | node_id | status | time_start | time_end |
+--------------------------+---------+--------+----------------------------+----------------------------+
| custom_task_on_all_nodes | 1 | ready | 2016-09-21T14:10:10.662786 | 2016-09-21T14:10:12.473825 |
| custom_task_on_all_nodes | 3 | ready | 2016-09-21T14:10:10.812715 | 2016-09-21T14:10:12.582099 |
| custom_task_on_all_nodes | 2 | ready | 2016-09-21T14:10:10.963538 | 2016-09-21T14:10:12.693458 |
+--------------------------+---------+--------+----------------------------+----------------------------+

Revision history for this message
Ilya Kutukov (ikutukov) wrote :

[root@nailgun ~]# fuel2 graph execute -t yaql_graph -e 1 --force
Deployment task with id 36 for the environment 1 has been started.
[root@nailgun ~]# fuel2 task history show 37
+--------------------------+---------+--------+----------------------------+----------------------------+
| task_name | node_id | status | time_start | time_end |
+--------------------------+---------+--------+----------------------------+----------------------------+
| custom_task_on_all_nodes | 1 | ready | 2016-09-21T14:23:48.492795 | 2016-09-21T14:23:50.443490 |
| custom_task_on_all_nodes | 3 | ready | 2016-09-21T14:23:48.641984 | 2016-09-21T14:23:50.384106 |
| custom_task_on_all_nodes | 2 | ready | 2016-09-21T14:23:48.871219 | 2016-09-21T14:23:50.626829 |
+--------------------------+---------+--------+----------------------------+----------------------------+
[root@nailgun ~]# fuel2 graph execute -t yaql_graph -e 1
Deployment task with id 38 for the environment 1 has been started.
[root@nailgun ~]# fuel2 task history show 39
+--------------------------+---------+---------+----------------------------+----------------------------+
| task_name | node_id | status | time_start | time_end |
+--------------------------+---------+---------+----------------------------+----------------------------+
| custom_task_on_all_nodes | 1 | skipped | 2016-09-21T14:23:58.551713 | 2016-09-21T14:23:58.551727 |
| custom_task_on_all_nodes | 3 | skipped | 2016-09-21T14:23:58.576595 | 2016-09-21T14:23:58.576608 |
| custom_task_on_all_nodes | 2 | skipped | 2016-09-21T14:23:58.601158 | 2016-09-21T14:23:58.601171 |
+--------------------------+---------+---------+----------------------------+----------------------------+

Revision history for this message
Ilya Kutukov (ikutukov) wrote :
Download full text (3.7 KiB)

Having clean nodes set-up:

[root@nailgun ~]# fuel2 env add nodes -e 1 -n 4 -r controller
[root@nailgun ~]# fuel2 env add nodes -e 1 -n 5 -r compute
[root@nailgun ~]# fuel2 env add nodes -e 1 -n 6 -r cinder
[root@nailgun ~]# fuel2 env nodes provision -e 1 -n 4 5 6
Provisioning task with id 43 for the nodes 4, 5, 6 within the environment 1 has been started.

| 43 | ready | provision | None | 1 | {} | False | 100 |

[root@nailgun ~]# fuel2 graph execute -t yaql_graph -e 1
Deployment task with id 44 for the environment 1 has been started.

[root@nailgun ~]# fuel2 task list
+----+--------+---------------+--------------+---------+--------+---------+----------+
| id | status | name | graph_type | cluster | result | dry_run | progress |
+----+--------+---------------+--------------+---------+--------+---------+----------+
....
| 43 | ready | provision | None | 1 | {} | False | 100 |
| 44 | ready | deploy | None | 1 | {} | False | 100 |
| 45 | ready | deployment | yaql_graph | 1 | {} | False | 100 |
+----+--------+---------------+--------------+---------+--------+---------+----------+

[root@nailgun ~]# fuel2 task history show 45
+--------------------------+---------+--------+----------------------------+----------------------------+
| task_name | node_id | status | time_start | time_end |
+--------------------------+---------+--------+----------------------------+----------------------------+
| custom_task_on_all_nodes | 5 | ready | 2016-09-21T14:45:53.816717 | 2016-09-21T14:45:55.607118 |
| custom_task_on_all_nodes | 4 | ready | 2016-09-21T14:45:53.926970 | 2016-09-21T14:45:55.836567 |
| custom_task_on_all_nodes | 6 | ready | 2016-09-21T14:45:54.046535 | 2016-09-21T14:45:55.870040 |
+--------------------------+---------+--------+----------------------------+----------------------------+

And repeat:

[root@nailgun ~]# fuel2 graph execute -t yaql_graph -e 1

| 43 | ready | provision | None | 1 | {} | False | 100 |
| 44 | ready | deploy | None | 1 | {} | False | 100 |
| 45 | ready | deployment | yaql_graph | 1 | {} | False | 100 |
| 46 | ready | deploy | None | 1 | {} | False | 100 |
| 47 | ready | deployment | yaql_graph | 1 | {} | False | 100 |
+----+--------+---------------+--------------+---------+--------+---------+----------+

Result is skipped

[root@nailgun ~]# fuel2 task history show 47
+--------------------------+---------+---------+----------------------------+----------------------------+
| task_name | node_id | status | time_start | time_end |
+--------------------------+---------+---------+----------------------------+----------------------------+
| custom_task_on_all_nodes | 4 | skipped | 2016-09-21T14:47:07.160248 | 2016-09-21T14:47:07.160260 |
| custom_task_on_all_nodes | 6 | skipped | 2016-09-21T14:47:07.186977 | 2016-09-21T14:47:07.1869...

Read more...

Revision history for this message
Ilya Kutukov (ikutukov) wrote :

UPD from bgaifulling: at the moment when graph was running all slaves was offline, maybe smth with wait timers.

Changed in fuel:
importance: Undecided → Medium
status: New → Confirmed
importance: Medium → High
Revision history for this message
Ilya Kutukov (ikutukov) wrote :

Summary: we have an assumption based on logs that all fuel nodes was offline when yaql_graph was run. YAQL expressions in task and graph itself seems to be valid. Experiments with local env demonstrated us the proper behaviour.
It's recommended to pay attention to the testsuite timings and operations order.

Changed in fuel:
assignee: Ilya Kutukov (ikutukov) → Fuel QA Team (fuel-qa)
Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

Moved to invalid, lets wait for a reproduce here [1]. For now job is green.
Feel free to reopen if it appears again.

[1] https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.custom_graph/

Changed in fuel:
status: Confirmed → Incomplete
status: Incomplete → Invalid
Revision history for this message
Andrey Lavrentyev (alavrentyev) wrote :

@Alexander, please bear in mind that there is a suspicious output from tests on those checks that make sure that graphs are executed.

From that output: http://paste.openstack.org/show/582596/
It doesn't make sense whether the check was a success or it failed silently

Revision history for this message
Bulat Gaifullin (bulat.gaifullin) wrote :

Some nodes was offline state when graph execution started.
Please add one additional step, which will check that nodes in online state before execute custom_graph.

Revision history for this message
ElenaRossokhina (esolomina) wrote :

it seems this issue was reproduced again on snapshot #299.
Case custom_yaql_expression_tasks failed with Traceback
<...>
  File "/home/jenkins/workspace/9.x.system_test.ubuntu.custom_graph/fuelweb_test/tests/tests_custom_graph/test_custom_graph.py", line 790, in custom_yaql_expression_tasks
    cluster_id, ['compute'])[0]
IndexError: list index out of range
Graph was applied only on controller node, other nodes were still unavailable after provisioning.
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.custom_graph/70/consoleFull

Changed in fuel:
status: Invalid → Confirmed
Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

We should forcibly start nailgun-agent after reboot on the nodes under provisioning stage

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Nailgun agent runs each minute by cron. It is not a daemon at all, so we cannot 'start' it

Changed in fuel:
status: Confirmed → Invalid
assignee: Vladimir Sharshov (vsharshov) → Fuel QA Team (fuel-qa)
Revision history for this message
Bulat Gaifullin (bulat.gaifullin) wrote :

The problem is that node is 'offline' state in nailgun after reboot, because assessid switch node to offline state during reboot, but railgun-agent does not report new state of node exactly after reboot.
The workaround is:
for case of using sequences of graphs:
  added task which will manually start nailgun_agent after reboot task
for tests:
  after provisioning manually wait until all nodes became online.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/375578

Changed in fuel:
status: Invalid → Confirmed
assignee: Fuel QA Team (fuel-qa) → Alexander Kurenyshev (akurenyshev)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/376355

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/mitaka)

Reviewed: https://review.openstack.org/375520
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=2f1f8bc0f5367e48835f7800456eeadac94aa260
Submitter: Jenkins
Branch: stable/mitaka

commit 2f1f8bc0f5367e48835f7800456eeadac94aa260
Author: Alexander Kurenyshev <email address hidden>
Date: Fri Sep 23 17:25:43 2016 +0300

    Add waiting for nodes become online for custom graph tests

    This patch adds waiting for nodes become online
    to avoiding problem with reboot task.

    Closes-Bug:1626072
    Change-Id: I05785eb134544fe76c08c8ff64e4e2524ae4a8e7

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/376355
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=a942a284756e72ba241bce0f2ff64fcd4ba5f463
Submitter: Jenkins
Branch: master

commit a942a284756e72ba241bce0f2ff64fcd4ba5f463
Author: Alexander Kurenyshev <email address hidden>
Date: Fri Sep 23 17:25:43 2016 +0300

    Add waiting for nodes become online for custom graph tests

    This patch adds waiting for nodes become online
    to avoiding problem with reboot task.

    Closes-Bug:1626072
    Change-Id: I05785eb134544fe76c08c8ff64e4e2524ae4a8e7

Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verification
Revision history for this message
ElenaRossokhina (esolomina) wrote :
Revision history for this message
ElenaRossokhina (esolomina) wrote :

verified: 9.1 snapshot #319

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: on-verification
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/386004

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/mitaka)

Reviewed: https://review.openstack.org/375578
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=161d971861fed20b1469aec4c3f55ac2323f0664
Submitter: Jenkins
Branch: stable/mitaka

commit 161d971861fed20b1469aec4c3f55ac2323f0664
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Sep 23 18:36:21 2016 +0300

    Run nailgun-agent on rebooted nodes

    Without this change some node can be marked wrongly
    as offline in Nailgun side.

    Change-Id: I4a89ac101867effe6f277c2dcaa93e9b67b65875
    Closes-Bug: #1626072

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/396563

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/386004
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=de9c731fc290b9bf7bafc88fa2e247e7d2ae01fd
Submitter: Jenkins
Branch: master

commit de9c731fc290b9bf7bafc88fa2e247e7d2ae01fd
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Sep 23 18:36:21 2016 +0300

    Run nailgun-agent on rebooted nodes

    Without this change some node can be marked wrongly
    as offline in Nailgun side.

    Change-Id: I4a89ac101867effe6f277c2dcaa93e9b67b65875
    Closes-Bug: #1626072

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/newton)

Reviewed: https://review.openstack.org/396563
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=03b914edf275ac62d59dc90153fef9d532250c8c
Submitter: Jenkins
Branch: stable/newton

commit 03b914edf275ac62d59dc90153fef9d532250c8c
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Fri Sep 23 18:36:21 2016 +0300

    Run nailgun-agent on rebooted nodes

    Without this change some node can be marked wrongly
    as offline in Nailgun side.

    Change-Id: I4a89ac101867effe6f277c2dcaa93e9b67b65875
    Closes-Bug: #1626072

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-astute 11.0.0.0rc1

This issue was fixed in the openstack/fuel-astute 11.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.