OpenStack Compute (nova)

Nova scheduler randomly fails to schedule CPU-pinned instance-flavors with hugepages - fails increases as running instance count grows

Bug #1738501 reported by Trygve Vea on 2017-12-15

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Confirmed	Low	Unassigned

Bug Description

Description
===========
Isolated to single hypervisor.
Nova scheduler randomly fails to schedule CPU-pinned instance-flavors with hugepages - fails increases as running instance count grows.

Steps to reproduce
==================

1) Hypervisor with two numa-nodes, 2x Intel Gold 6126, 256GB RAM (128GB in each numa node), 61440x2M hugepages in each node. Hypervisor running nothing else than OpenStack

2) Flavor specified with:
- 4 vCPUs
- 20480 MB RAM
- hw:cpu_policy dedicated
- hw:cpu_thread_policy require
- hw:mem_page_size 2MB

3) Try to schedule 12 instances of the mentioned flavor

Expected result
===============

12 instances running on hypervisor, neatly packed using up all hugepages.

Actual result
=============

NUMA node 0 is full, NUMA node 1 has 2-3 instances or so. This varies from attempt to attempt.

Workaround
==========

Leave all running instances as they are, schedule more instances until the desired amount of instances have been successfully created. (It took 32 create attempts to fill all 12 slots for me)

Problem will not exist if hugepages are disabled from flavor and hypervisor.

Environment
===========
Running OpenStack Ocata, RDO packages on Centos 7.4.
Linux 3.10.0-514.10.2.el7.x86_64
nova 15.0.7

Compute:
openstack-nova-compute-15.0.7-1.el7.noarch

Ctrl:
openstack-nova-conductor-15.0.7-1.el7.noarch
python2-novaclient-7.1.2-1.el7.noarch
python-nova-15.0.7-1.el7.noarch
openstack-nova-novncproxy-15.0.7-1.el7.noarch
openstack-nova-placement-api-15.0.7-1.el7.noarch
openstack-nova-common-15.0.7-1.el7.noarch
openstack-nova-api-15.0.7-1.el7.noarch
openstack-nova-scheduler-15.0.7-1.el7.noarch
openstack-nova-console-15.0.7-1.el7.noarch

Using Libvirt+KVM

libvirt 3.2.0-14.el7_4 (ev)
qemu 2.9.0-16.el7_4 (ev)

Storage is pure qcow2 on /var/lib/nova

Neutron with linuxbridge-agent for networking.

Tags:

Matthew Edmonds (edmondsw) on 2018-02-15

tags:	added: sched
tags:	added: scheduler removed: sched

Sylvain Bauza (sylvain-bauza) on 2018-02-20

tags:	added: libvirt numa removed: scheduler
Changed in nova:
status:	New → Confirmed
importance:	Undecided → Low

Revision history for this message

David Hill (david-hill-ubisoft) wrote on 2018-12-23:

Quick question here: Does it solve the problem if you set a number that is lower or equal to the number of available computes that have available resources ? IE: You have 6 computes with available resources and you use "--max-count 6" or "--count 6" ?

Revision history for this message

Trygve Vea (trygve-vea-gmail) wrote on 2019-05-29:

No, that didn't help at the time. I remember when there were a single slot left, I had to retry multiple times - eventually it did work.

To the extend I am unsure about this, I know I tested this on a 2x6126 system, which has 48 threads divided over two sockets. If at least 1 VM would successfully build at every attempt, I would need only 12 tries to fill up all slots - not 32 as stated.

I don't have any servers available for testing as of now. But we're considering using VPP for networking, which does require hugepages on the VM flavors.

Revision history for this message

Ilya Popov (ilya-p) wrote on 2021-08-23:

Could be related with
https://bugs.launchpad.net/nova/+bug/1940668

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.