[Reduced footprint] Deploy fails with Task[primary-openstack-controller/8]

Bug #1585596 reported by Ksenia Svechnikova
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Dmitriy Kruglov
Mitaka
Invalid
High
Dmitriy Kruglov

Bug Description

Fuel 9.0 376

Detailed bug description: Deploy with compute, virt role fails with <All nodes are finished. Failed tasks: Task[primary-openstack-controller/8] Stopping the deployment process!"}
for report failed: Task status provided '' is not supported; Task name is not provided>

Steps to reproduce:

0) Prepare env HW with 1 node
1) Add FEATURE_GROUPS: ["advanced"] to /etc/nailgun/settings.yaml
2) service nailgun restart
3) Add virt, comute role to the node
4) Add json config to the virt role

[{u'mem': 4, u'id': 1, u'cpu': 2}, {u'mem': 4, u'id': 2, u'cpu': 2}, {u'mem': 4, u'id': 3, u'cpu': 2}]

5) Run spawn_vm task
6) Assign controllers to the new KVM nodes (3 nodes)
7) Deploy cluster

Expected results: cluster deployed successfully
Actual result: Deploy fails

[root@lab5-fuel9 ~]# fuel node --env 1
id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
---|--------|-----------------------------|---------|--------------|-------------------|---------------|---------------|--------|---------
10 | error | Untitled (80:cb) | 1 | 172.16.40.78 | 52:54:00:4c:80:cb | controller | | True | 1
9 | error | Untitled (40:20) | 1 | 172.16.40.77 | 52:54:00:f4:40:20 | controller | | True | 1
8 | error | Untitled (b2:9a) | 1 | 172.16.40.79 | 52:54:00:1a:b2:9a | controller | | True | 1
1 | error | cz7370-kvm.host-telecom.com | 1 | 172.16.40.71 | 00:25:90:ed:1e:04 | compute, virt | | True | 1
[root@lab5-fuel9 ~]#

From node-8 MODULAR: http://paste.openstack.org/show/505356/

Puppet-apply errors: http://paste.openstack.org/show/498578/

oslo_service errors: http://paste.openstack.org/show/505357/

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :
tags: added: area-python
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

Issue isn't reproduced on iso 9.0 mos 404

no longer affects: fuel/future
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

The scenario is reproduced on 9.x snapshot #151.
Snapshot details:
UBUNTU_MIRROR_ID=ubuntu-2016-08-03-174238
CENTOS_MIRROR_ID=centos-7.2.1511-2016-05-31-083834
MOS_UBUNTU_MIRROR_ID=9.0-2016-08-17-164322
MOS_CENTOS_OS_MIRROR_ID=os-2016-06-23-135731
MOS_CENTOS_PROPOSED_MIRROR_ID=proposed-2016-08-18-044322
MOS_CENTOS_UPDATES_MIRROR_ID=updates-2016-06-23-135916
MOS_CENTOS_HOLDBACK_MIRROR_ID=holdback-2016-06-23-140047
MOS_CENTOS_HOTFIX_MIRROR_ID=hotfix-2016-07-18-162958
MOS_CENTOS_SECURITY_MIRROR_ID=security-2016-06-23-140002

Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :
Revision history for this message
Dmitry Kalashnik (dkalashnik) wrote :

Whole feature is blocked

tags: added: blocker-for-qa
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

The issue causes 2 main scenarios (3 controllers deployment) in the feature test suite to fail.
This blocks the feature from using in production.

The tests are executed in baremetal environments (as virtual controllers cannot be deployed by devops) and are not a part of Swarm, thus feature health is verified on acceptance test runs.

no longer affects: fuel/newton
Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :

Nova fails with

2016-09-17 20:34:59.645 24680 ERROR oslo_service.service OperationalError: (_mysql_exceptions.OperationalError) (1054, "Unknown column 'services.last_seen_up' in 'field list'") [SQL: u'SELECT services.created_at AS services_created_at, services.updated_at AS services_updated_at, services.deleted_at AS services_deleted_at, services.deleted AS services_deleted, services.id AS services_id, services.host AS services_host, services.`binary` AS services_binary, services.topic AS services_topic, services.report_count AS services_report_count, services.disabled AS services_disabled, services.disabled_reason AS services_disabled_reason, services.last_seen_up AS services_last_seen_up, services.forced_down AS services_forced_down, services.version AS services_version \nFROM services \nWHERE services.deleted = %s AND services.host = %s AND services.`binary` = %s \n LIMIT %s'] [parameters: (0, '10.109.1.3', 'nova-metadata', 1)]

due to mysql SIGSEGV

19:15:34 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=0
max_threads=151
thread_count=1
connection_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 76505 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fe11be794e0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fe11a0d5e88 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0x7fe11a90fb7c]
/usr/sbin/mysqld(handle_fatal_signal+0x3c2)[0x7fe11a6605c2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fe119355330]
/usr/sbin/mysqld(thd_get_ha_data+0xc)[0x7fe11a6ad54c]
/usr/sbin/mysqld(_Z20thd_binlog_trx_resetP3THD+0x2e)[0x7fe11a8ba79e]
/usr/sbin/mysqld(_Z17wsrep_post_commitP3THDb+0xcc)[0x7fe11a79a32c]
/usr/sbin/mysqld(_Z12trans_commitP3THD+0x6f)[0x7fe11a780dcf]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x38d1)[0x7fe11a6ee851]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x3c8)[0x7fe11a6f29d8]
/usr/sbin/mysqld(+0x508c24)[0x7fe11a6f2c24]
/usr/sbin/mysqld(_Z19do_handle_bootstrapP3THD+0x111)[0x7fe11a6f2ff1]
/usr/sbin/mysqld(+0x509060)[0x7fe11a6f3060]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fe11934d184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe118a7037d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fe0f00009f0): is an invalid pointer
Connection ID (thread ID): 1
Status: NOT_KILLED

Looks like an issue with mysql

Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :

Nova fails with

2016-09-17 20:34:59.645 24680 ERROR oslo_service.service OperationalError: (_mysql_exceptions.OperationalError) (1054, "Unknown column 'services.last_seen_up' in 'field list'") [SQL: u'SELECT services.created_at AS services_created_at, services.updated_at AS services_updated_at, services.deleted_at AS services_deleted_at, services.deleted AS services_deleted, services.id AS services_id, services.host AS services_host, services.`binary` AS services_binary, services.topic AS services_topic, services.report_count AS services_report_count, services.disabled AS services_disabled, services.disabled_reason AS services_disabled_reason, services.last_seen_up AS services_last_seen_up, services.forced_down AS services_forced_down, services.version AS services_version \nFROM services \nWHERE services.deleted = %s AND services.host = %s AND services.`binary` = %s \n LIMIT %s'] [parameters: (0, '10.109.1.3', 'nova-metadata', 1)]

due to mysql SIGSEGV

19:15:34 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=0
max_threads=151
thread_count=1
connection_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 76505 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fe11be794e0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fe11a0d5e88 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0x7fe11a90fb7c]
/usr/sbin/mysqld(handle_fatal_signal+0x3c2)[0x7fe11a6605c2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fe119355330]
/usr/sbin/mysqld(thd_get_ha_data+0xc)[0x7fe11a6ad54c]
/usr/sbin/mysqld(_Z20thd_binlog_trx_resetP3THD+0x2e)[0x7fe11a8ba79e]
/usr/sbin/mysqld(_Z17wsrep_post_commitP3THDb+0xcc)[0x7fe11a79a32c]
/usr/sbin/mysqld(_Z12trans_commitP3THD+0x6f)[0x7fe11a780dcf]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x38d1)[0x7fe11a6ee851]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x3c8)[0x7fe11a6f29d8]
/usr/sbin/mysqld(+0x508c24)[0x7fe11a6f2c24]
/usr/sbin/mysqld(_Z19do_handle_bootstrapP3THD+0x111)[0x7fe11a6f2ff1]
/usr/sbin/mysqld(+0x509060)[0x7fe11a6f3060]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7fe11934d184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe118a7037d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fe0f00009f0): is an invalid pointer
Connection ID (thread ID): 1
Status: NOT_KILLED

Looks like an issue with mysql

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

The sigsegv has nothing to do with the issue - it happens an hour before the issue and happens only on first mysql start and never happens again. Please see https://bugs.launchpad.net/fuel/+bug/1541338 for details. What really happens is that according to atop logs we have load average of >12, that's why nova-db-sync fails by timeout - it runs only 6 migrations in 40(!) seconds on an empty database. So this issue is purely environmental and should be handled by a team that sets up the testing environment.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

A little bit of snippets from logs:

Node has 2 vCPUs

Migrations in nova-manage.log:

2016-09-17 20:00:50.346 477 INFO migrate.versioning.api [-] 0 -> 1...
2016-09-17 20:00:55.156 477 INFO migrate.versioning.api [-] done
2016-09-17 20:00:55.157 477 INFO migrate.versioning.api [-] 1 -> 2...
2016-09-17 20:01:00.541 477 INFO migrate.versioning.api [-] done
2016-09-17 20:01:00.542 477 INFO migrate.versioning.api [-] 2 -> 3...
2016-09-17 20:01:06.271 477 INFO migrate.versioning.api [-] done
2016-09-17 20:01:06.272 477 INFO migrate.versioning.api [-] 3 -> 4...
2016-09-17 20:01:10.814 477 INFO migrate.versioning.api [-] done
2016-09-17 20:01:10.815 477 INFO migrate.versioning.api [-] 4 -> 5...
2016-09-17 20:01:24.629 477 INFO migrate.versioning.api [-] done
2016-09-17 20:01:24.630 477 INFO migrate.versioning.api [-] 5 -> 6...
2016-09-17 20:01:32.337 477 INFO migrate.versioning.api [-] done
2016-09-17 20:01:32.338 477 INFO migrate.versioning.api [-] 6 -> 7...
2016-09-17 20:01:36.982 477 INFO migrate.versioning.api [-] done

atop snippet from node-2

CPU | sys 19% | user 180% | irq 1% | | idle 0% | wait 0% | | steal 0% | guest 0% | curf 2.10GHz | curscal ?% |
cpu | sys 10% | user 89% | irq 0% | | idle 0% | cpu000 w 0% | | steal 0% | guest 0% | curf 2.10GHz | curscal ?% |
cpu | sys 9% | user 90% | irq 0% | | idle 0% | cpu001 w 0% | | steal 0% | guest 0% | curf 2.10GHz | curscal ?% |
CPL | avg1 11.93 | avg5 9.92 | | avg15 6.36 | | | csw 182594 | intr 37171 | | | numcpu 2 |
MEM | tot 3.9G | free 192.0M | cache 1.4G | dirty 1.6M | buff 184.3M | | slab 179.1M | | | |

Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :

QA Team is testing this scenario with 4 vCPU for virtual controller node

Changed in fuel:
status: Invalid → Incomplete
Andrey Maximov (maximov)
Changed in fuel:
assignee: Fuel Toolbox (fuel-toolbox) → MOS Linux (mos-linux)
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

The test has been re-executed with increased number of vcpus per node (4 this time) and still fails with the same failure symptoms - deploy failed with 'Deploy fails with Task[primary-openstack-controller/8]' error.

However, this time the issue occurred after mysql split brain. The issue is being investigated further.

tags: removed: feature-reduced-footprint
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Dmitry, could you point where you've seen split-brain? I couldn't find any split-brain related entries in this snapshot.

Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Ivan, sorry for that misunderstanding. Split brain occurred on the env which is active at the moment and is being investigated.
The corresponding snapshot will be added later.

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Why do you change status to 'confirmed' without attaching appropriate snapshot or providing access to failed environment? How am I should investigate?

Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Find please the snapshot of the latest run with split-brain on node-2 (virtual primary controller).

Revision history for this message
Ivan Suzdal (isuzdal) wrote :

Well, according to log entries from this snapshot I'd say the root cause is very intensive IO.

Firstly let's take a look into syslog:
<27>Sep 22 19:49:32 node-2 ocf-mysql-wss: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:30444, this is a split-brain!
Remember this time position.

At this time we can see in pacemaker.log:
Sep 22 19:49:32 [20448] node-2.test.domain.local crmd: info: process_lrm_event: Operation p_mysqld_monitor_60000: unknown error (node=node-2.test.domain.local, call=52, rc=1, cib-update=350, confirmed=false)

And in the end let's ask atop, what happened at this time.
At 19:48:55 and 19:49:35 we can see what vda disk was busy:
CPU | sys 16% | user 57% | irq 0% | idle 249% | wait 77% | guest 0% | curf 2.10GHz | curscal ?% |

DSK | vda | busy 98% | read 6 | write 1853 | KiB/w 14 | MBr/s 0.00 | MBw/s 1.28 | avio 10.4 ms |

and

CPU | sys 30% | user 187% | irq 1% | idle 128% | wait 55% | guest 0% | curf 2.10GHz | curscal ?% |

DSK | vda | busy 78% | read 448 | write 719 | KiB/w 51 | MBr/s 0.25 | MBw/s 1.80 | avio 13.2 ms |

Pay attention to "wait" and "busy" values.

So, my conclusion is: pcm couldn't get answer from mysqld (p_mysql_monitor) for a reasonable time and made decision to restart resource.

Split-brain in the syslog - is a result of next situation:
When ocf script have been executing, the old mysqld pid is still existed. It was occurred only once and never happened again.

Revision history for this message
Dmitry Kalashnik (dkalashnik) wrote :

@isuzdal

I am not agree with invalid state, bug is 100% reproducible, moving back to confirmed.

Split brain was occurred at least twice, last snapshot was from additional run. Previous time was observed by Vladimir Kuklin.

I want to point you that from performance side that test/feature are like our swarm tests.
So we have 1 hw nodes with "Compute" role and 3 VMs with "Controller" roles on it.

So where could be an issue with performance? And what is your propose?

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

As it was said many times before, the issue occurs because of the following sequence:
* test was executed on a system without resources needed (not enough CPU / slow IO) - that's performance issue
* one instance on mysql wasn't able to respond to pacemaker and was restarted - that's normal behavior
* restart of one instance caused *temporal* error messages about split-brain, caused by the fact that one instance was restarted, but because of slow performance it wasn't done *immediately* - that's performance issue
* when (finally) mysql was restarted and cluster was re-synced it was back to normal operation. If any queries to mysql failed then it could be fixed in two ways - improve lab performance or implement retries.

The issue is with slow IO on VMs, so please try to improve this on the lab.

Changed in fuel:
assignee: MOS Linux (mos-linux) → Fuel CI (fuel-ci)
Revision history for this message
Dmitry Kalashnik (dkalashnik) wrote :

Removing Fuel-ci, that is our hw lab.
Ok, so that bug is duplicate of https://bugs.launchpad.net/fuel/+bug/1620268 and we will rerun tests with fixes for that bug.

Changed in fuel:
assignee: Fuel CI (fuel-ci) → nobody
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :
Revision history for this message
Dmitry Kalashnik (dkalashnik) wrote :

Reduced to High, required additional investigation from qa side

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: nobody → Dmitry Kalashnik (dkalashnik)
Changed in fuel:
assignee: Dmitry Kalashnik (dkalashnik) → Dmitriy Kruglov (dkruglov)
Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

The reported issue is not a functional one, but relates to the underlying test environment and should be solved by using the appropriate, more efficient HW (supposedly the IO utilization is a cause here and should be improved).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.