random nova-compute service down across cluster when testing with rally

Bug #1463604 reported by Hugh Ma
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Unassigned

Bug Description

Fuel 2014.2-6.0 / Juno Ubuntu 12.04.4
1 Controller / 40 Compute Nodes
over 900 vCPUs and 3TB of memory

Deployment is fine. The problem crops up when using Rally to attempt to spawn a large amount of m1.medium VMs (2vcpu/4gbram).

When spawning small around (50-200) there usually is no problem. The issue starts to crop up around (300-500 VMs). Compute-service starts to go down. Usually it is 1-3 of them go down. I see nodes go offline in fuel. Sometimes I IPMI in and it appears the system is completely hung. Other times, it will require a force reboot. Restarting nova-compute does NOT work. Error in fuel shows.

Failed to publish message to topic 'conductor': [Errno 110] Connection timed out
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 681, in ensure
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit return method()
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 784, in _publish
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit publisher = cls(self.conf, self.channel, topic=topic, **kwargs)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 395, in __init__
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit **options)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 340, in __init__
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.reconnect(channel)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 348, in reconnect
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit routing_key=self.routing_key)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 84, in __init__
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.revive(self._channel)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 218, in revive
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.declare()
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 104, in declare
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.exchange.declare()
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/entity.py", line 166, in declare
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit nowait=nowait, passive=passive,
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 620, in exchange_declare
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit (40, 11), # Channel.exchange_declare_ok
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 67, in wait
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.channel_id, allowed_methods)
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 237, in _wait_method
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit self.method_reader.read_method()
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/amqp/method_framing.py", line 189, in read_method
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit raise m
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit error: [Errno 110] Connection timed out
2015-06-09 22:12:12.913 18502 TRACE oslo.messaging._drivers.impl_rabbit

Another node shows a different error.

Exception during message handling: Failed to terminate process 15995 with SIGKILL: Device or resource busy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 137, in _dispatch_and_reply
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 180, in _dispatch
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 126, in _do_dispatch
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 88, in wrapped
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher payload)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 71, in wrapped
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return f(self, context, *args, **kw)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 298, in decorated_function
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher pass
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 284, in decorated_function
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 348, in decorated_function
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 326, in decorated_function
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher kwargs['instance'], e, sys.exc_info())
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 314, in decorated_function
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return function(self, context, *args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2510, in terminate_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher do_terminate_instance(instance, bdms)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py", line 272, in inner
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher return f(*args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2508, in do_terminate_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher self._set_instance_error_state(context, instance)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2498, in do_terminate_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher self._delete_instance(context, instance, bdms, quotas)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/hooks.py", line 131, in inner
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher rv = f(*args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2467, in _delete_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher quotas.rollback()
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2444, in _delete_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher self._shutdown_instance(context, instance, bdms)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2374, in _shutdown_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher requested_networks)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2363, in _shutdown_instance
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher block_device_info)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 1054, in destroy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher self._destroy(instance)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 1010, in _destroy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher instance=instance)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(self.type_, self.value, self.tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 979, in _destroy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher virt_dom.destroy()
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher result = proxy_call(self._autowrap, f, *args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher rv = execute(f, *args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher six.reraise(c, e, tb)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher rv = meth(*args, **kwargs)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/libvirt.py", line 918, in destroy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher if ret == -1: raise libvirtError ('virDomainDestroy() failed', dom=self)
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher libvirtError: Failed to terminate process 15995 with SIGKILL: Device or resource busy
2015-06-09 20:48:26.679 18667 TRACE oslo.messaging.rpc.dispatcher

This appears very often also, not sure if it means anything

2015-06-09 23:02:18 INFO

Unable to import guestfsfalling back to VFSLocalFS

2015-06-09 23:02:17 INFO

Compute_service record updated for ('node-193', 'node-193.domain.tld')

2015-06-09 23:02:17 INFO

Creating image

2015-06-09 23:02:17 WARNING

No network configured!

2015-06-09 23:02:17 INFO

Compute_service record updated for ('node-193', 'node-193.domain.tld')

2015-06-09 23:02:17 INFO

Compute_service record updated for ('node-193', 'node-193.domain.tld')

2015-06-09 23:02:17 WARNING

Claim successful

2015-06-09 23:02:17 WARNING

disk limit: 5529.00 GB, free: 5329.00 GB

2015-06-09 23:02:17 WARNING

Total disk: 5529 GB, used: 200.00 GB

2015-06-09 23:02:17 WARNING

memory limit: 96677.00 MB, free: 75685.00 MB

2015-06-09 23:02:17 WARNING

Total memory: 96677 MB, used: 20992.00 MB

2015-06-09 23:02:17 WARNING

Attempting claim: memory 4096 MB, disk 40 GB

2015-06-09 23:02:17 WARNING

Starting instance...

2015-06-09 23:02:17 INFO

Compute_service record updated for node-193:node-193.domain.tld

2015-06-09 23:02:17 INFO

Compute_service record updated for ('node-193', 'node-193.domain.tld')

2015-06-09 23:52:14 ERR

Error from last host: node-198 (node node-198.domain.tld): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2014, in do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2149, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance f52e67b9-720c-4438-b665-08c634f7fdcd was re-scheduled: Cannot write data: Broken pipe\n']

2015-06-09 23:37:37 ERR

Error from last host: node-158 (node node-158.domain.tld): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2014, in do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2149, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u"RescheduledException: Build of instance 4b7c3a2f-8592-4910-aa94-ac0f093535b3 was re-scheduled: [Errno 5] Input/output error: '/var/lib/nova/instances/4b7c3a2f-8592-4910-aa94-ac0f093535b3'\n"]

2015-06-09 23:37:35 ERR

Error from last host: node-158 (node node-158.domain.tld): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2014, in do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2149, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u"RescheduledException: Build of instance 71993a19-a08c-4f39-91db-0913e302b0ff was re-scheduled: [Errno 5] Input/output error: '/var/lib/nova/instances/71993a19-a08c-4f39-91db-0913e302b0ff'\n"]

2015-06-09 23:30:14 ERR

Error from last host: node-191 (node node-191.domain.tld): [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2014, in do_build_and_run_instance\n filter_properties)\n', u' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2149, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance c8237bb2-34a0-4ed9-aac3-b86ee68e019a was re-scheduled: Unexpected error while running command.\nCommand: qemu-img create -f qcow2 -o backing_file=/var/lib/nova/instances/_base/168203d2c86483edf54feb5a31fa52931fd0afeb /var/lib/nova/instances/c8237bb2-34a0-4ed9-aac3-b86ee68e019a/disk\nExit code: 1\nStdout: u"Formatting \'/var/lib/nova/instances/c8237bb2-34a0-4ed9-aac3-b86ee68e019a/disk\', fmt=qcow2 size=2361393152 backing_file=\'/var/lib/nova/instances/_base/168203d2c86483edf54feb5a31fa52931fd0afeb\' encryption=off cluster_size=65536 lazy_refcounts=off \\n"\nStderr: u\'qemu-img: /var/lib/nova/instances/c8237bb2-34a0-4ed9-aac3-b86ee68e019a/disk: Could not write qcow2 header: Input/output error\\n\'\n']

Hugh Ma (hughkma)
description: updated
Hugh Ma (hughkma)
description: updated
Changed in fuel:
milestone: none → 6.0-updates
assignee: nobody → MOS Core Components (mos-core-components)
importance: Undecided → High
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I see two possible reasons for nodes go down:
 * either you overestimate total capacity of your cloud
 * or the problem is that default overcommits are two high in you case, and that causes failure when scheduler allocates too many instances on some compute nodes.

To check the later case, try reducing default overcommits and rerun the test. The parameters names are cpu_allocation_ratio and ram_allocation_ratio and they could be found in /etc/nova/nova.conf on controllers. You can read more about overcommitting and Nova parameters in these docs:
http://docs.openstack.org/openstack-ops/content/compute_nodes.html#overcommit
http://docs.openstack.org/juno/config-reference/content/list-of-compute-config-options.html

If that does not help, try to monitor compute nodes during tests to understand what causes them hang (CPU/memory exhaustion, some else?). The errors you observe in OpenStack logs seem to be a consequence of node hanging, not the cause.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Setting to Invalid as there is no response from the reporter for a month

Changed in fuel:
status: Incomplete → Invalid
Changed in fuel:
milestone: 6.0-updates → 6.0.1
Colin Watson (cjwatson)
Changed in fuel:
assignee: Registry Administrators (registry) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.