[CI] create_diagnostic_snapshot() timed out due to disk problems

Bug #1471846 reported by Roman Podoliaka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
High
Fuel CI

Bug Description

One of the recent staging jobs (http://jenkins-product.srt.mirantis.net:8080/job/5.0.3.staging.ubuntu.bvt_2/214/consoleFull) failed with:

======================================================================
ERROR: Deploy cluster in HA mode with VLAN Manager
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/venv-nailgun-tests/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/jenkins/venv-nailgun-tests/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/jenkins/venv-nailgun-tests/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/5.0.3.staging.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 59, in wrapper
    "fail", func.__name__)
  File "/home/jenkins/workspace/5.0.3.staging.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 155, in create_diagnostic_snapshot
    task = env.fuel_web.task_wait(env.fuel_web.client.generate_logs(), 60 * 5)
  File "/home/jenkins/workspace/5.0.3.staging.ubuntu.bvt_2/fuelweb_test/__init__.py", line 48, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/5.0.3.staging.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 566, in task_wait
    "was exceeded: ".format(task=task["name"], timeout=timeout))
TimeoutError: Waiting task "dump" timeout 300 sec was exceeded:

astute logs show shotgun task was started, but hasn't finished:

2015-07-02T12:54:00 debug: [421] Try to execute command: shotgun -c /tmp/dump_config >> /var/log/dump.log 2>&1 && cat /var/www/nailgun/dump/last

dump.log:

2015-07-02 12:54:35 DEBUG 442 (utils) Trying to execute command: mkdir -p "/var/www/nailgun/dump/fuel-snapshot-2015-07-02_12-54-01/node-5.test.domain.local/commands"
2015-07-02 12:54:35 DEBUG 442 (manager) Dumping: {u'path': u'/etc/astute.yaml', 'host': {u'ssh-key': u'/root/.ssh/id_rsa', u'address': u'node-4.test.domain.local'}, u'type': u'file'}
2015-07-02 12:54:35 DEBUG 442 (driver) Initializing driver File: host={u'ssh-key': u'/root/.ssh/id_rsa', u'address': u'node-4.test.domain.local'}
2015-07-02 12:54:35 DEBUG 442 (driver) File to get: /etc/astute.yaml
2015-07-02 12:54:35 DEBUG 442 (driver) File to save: /var/www/nailgun/dump/fuel-snapshot-2015-07-02_12-54-01/node-4.test.domain.local/etc
2015-07-02 12:54:35 DEBUG 442 (driver) Getting remote file: /etc/astute.yaml /var/www/nailgun/dump/fuel-snapshot-2015-07-02_12-54-01/node-4.test.domain.local/etc
2015-07-02 12:54:35 DEBUG 442 (utils) Trying to execute command: mkdir -p "/var/www/nailgun/dump/fuel-snapshot-2015-07-02_12-54-01/node-4.test.domain.local/etc"

^ shotgun hanged here

kern.log on node-4 contains a traceback:

2015-07-02T12:53:33.176157+00:00 err: [ 3240.832040] INFO: task kworker/u2:0:6 blocked for more than 120 seconds.
2015-07-02T12:53:33.177386+00:00 err: [ 3240.840231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2015-07-02T12:53:33.177386+00:00 info: [ 3240.847423] kworker/u2:0 D ffff8800361900e0 0 6 2 0x00000000
2015-07-02T12:53:33.177386+00:00 info: [ 3240.847431] Workqueue: writeback bdi_writeback_workfn (flush-253:0)
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847433] ffff88007c7ff3d8 0000000000000046 ffff880000000002 ffff88007fc14580
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847435] ffff88007c7fffd8 ffff88007c7fffd8 ffff88007c7fffd8 0000000000014580
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847436] ffffffff81c14440 ffff88007c000000 ffff88007c7ff3a8 ffff88007fc14e28
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847438] Call Trace:
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847443] [<ffffffff811e6620>] ? __wait_on_buffer+0x30/0x30
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847447] [<ffffffff817479b9>] schedule+0x29/0x70
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847449] [<ffffffff81747a8f>] io_schedule+0x8f/0xd0
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847451] [<ffffffff811e662e>] sleep_on_buffer+0xe/0x20
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847453] [<ffffffff81744f6a>] __wait_on_bit_lock+0x5a/0xc0
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847455] [<ffffffff811e6620>] ? __wait_on_buffer+0x30/0x30
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847457] [<ffffffff8174504c>] out_of_line_wait_on_bit_lock+0x7c/0x90
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847460] [<ffffffff81089cc0>] ? wake_atomic_t_function+0x40/0x40
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847463] [<ffffffff811e6676>] __lock_buffer+0x36/0x40
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847466] [<ffffffff812a2aff>] do_get_write_access+0x49f/0x540
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847469] [<ffffffff812a2cf0>] jbd2_journal_get_write_access+0x30/0x50
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847472] [<ffffffff8127f033>] __ext4_journal_get_write_access+0x43/0x90
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847475] [<ffffffff8124b11a>] ? ext4_read_block_bitmap+0x3a/0x60
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847477] [<ffffffff812865ff>] ext4_mb_mark_diskspace_used+0x7f/0x500
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847480] [<ffffffff8119c95c>] ? kmem_cache_alloc+0x12c/0x150
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847482] [<ffffffff81287de1>] ext4_mb_new_blocks+0x2c1/0x4b0
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847484] [<ffffffff8127856c>] ? ext4_ext_check_overlap.isra.21+0xbc/0xd0
2015-07-02T12:53:33.177386+00:00 warning: [ 3240.847487] [<ffffffff8127de27>] ext4_ext_map_blocks+0x4d7/0xa70

Changed in fuel:
milestone: none → 5.0.3
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Kairat Kushaev (kkushaev) wrote :
Revision history for this message
Igor Shishkin (teran) wrote :

Hello Roman, Kairat.

It's not a traceback but just a message hang task timeout reached for pdflush.
It's normal situation and can't be a critical.

Moving to CI team for timeouts investigation. From HW side everything was fine on that step.
Probably we need to reschedule that job.

Changed in fuel:
assignee: Fuel DevOps (fuel-devops) → Fuel CI team (fuel-ci)
Changed in fuel:
status: Confirmed → Won't Fix
tags: added: fuel-ci
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.