Driving in the Fast Lane - CPU Pinning and NUMA Topology Awareness in OpenStack Compute

May 5, 2015Steve Gordon

The OpenStack Kilo release, extending upon efforts that commenced during the Juno cycle, includes a number of key enhancements aimed at improving guest performance. These enhancements allow OpenStack Compute (Nova) to have greater knowledge of compute host layout and as a result make smarter scheduling and placement decisions when launching instances. Administrators wishing to take advantage of these features can now create customized performance flavors to target specialized workloads including Network Function Virtualization (NFV) and High Performance Computing (HPC).

What is NUMA topology?

Historically, all memory on x86 systems was equally accessible to all CPUs in the system. This resulted in memory access times that were the same regardless of which CPU in the system was performing the operation and was referred to as Uniform Memory Access (UMA).

In modern multi-socket x86 systems system memory is divided into zones (called cells or nodes) and associated with particular CPUs. This type of division has been key to the increasing performance of modern systems as focus has shifted from increasing clock speeds to adding more CPU sockets, cores, and - where available - threads. An interconnect bus provides connections between nodes, so that all CPUs can still access all memory. While the memory bandwidth of the interconnect is typically faster than that of an individual node it can still be overwhelmed by concurrent cross node traffic from many nodes. The end result is that while NUMA facilitates faster memory access for CPUs local to the memory being accessed, memory access for remote CPUs is slower.

Newer motherboard chipsets expand on this concept by also providing NUMA style division of PCIe I/O lanes between CPUs. On such systems workloads receive a performance boost not only when their memory is local to the CPU on which they are running but when the I/O devices they use are too, and (relative) degradation where this is not the case. We’ll be coming back to this topic in a later post in this series.

By way of example, by running numactl --hardware on a Red Hat Enterprise Linux 7 system I can examine the NUMA layout of its hardware:

# numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 8191 MB node 0 free: 6435 MB node 1 cpus: 4 5 6 7 node 1 size: 8192 MB node 1 free: 6634 MB node distances: node 0 1 0: 10 20 1: 20 10

The output tells me that this system has two NUMA nodes, node 0 and node 1. Each node has 4 CPU cores and 8 GB of RAM associated with it. The output also shows the relative “distances” between nodes, this becomes important with more complex NUMA topologies with different interconnect layouts connecting nodes together.

Modern operating systems endeavour to take the NUMA topology of the system into account by providing additional services like numad that monitor system resource usage and endeavour to dynamically make adjustments to confirm that processes and their associated memory are placed optimally for best performance.

How does this apply to virtualization?

When running a guest operating system in a virtual machine there are actually two NUMA topologies involved, that of the physical hardware of the host and that of the virtual hardware exposed to the guest operating system. The host operating system and associated utilities are aware of the host’s NUMA topology and will optimize accordingly, but by exposing a NUMA topology to the guest that aligns with that of the physical hardware it is running on we can also assist the guest operating system to do the same.

Libvirt provides extensive options for tuning guests to take advantage of the hosts’ NUMA topology by among other things, pinning virtual CPUs to physical CPUs, pinning emulator threads associated with the guest to physical CPUs, and tuning guest memory allocation policies both for normal memory (4k pages) and huge pages (2 MB or 1G pages). Running the virsh capabilities command, which displays the capabilities of the host, on the same host used in the earlier example yields a wide range of information but in particular we’re interested in the <topology> section:

# virsh capabilities ... <topology> <cells num='2'> <cell id='0'> <memory unit='KiB'>4193872</memory> <pages unit='KiB' size='4'>1048468</pages> <pages unit='KiB' size='2048'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> </distances> <cpus num='4'> <cpu id='0' socket_id='0' core_id='0' siblings='0'/> <cpu id='1' socket_id='0' core_id='1' siblings='1'/> <cpu id='2' socket_id='0' core_id='2' siblings='2'/> <cpu id='3' socket_id='0' core_id='3' siblings='3'/> </cpus> </cell> <cell id='1'> <memory unit='KiB'>4194304</memory> <pages unit='KiB' size='4'>1048576</pages> <pages unit='KiB' size='2048'>0</pages> <distances> <sibling id='0' value='20'/> <sibling id='1' value='10'/> </distances> <cpus num='4'> <cpu id='4' socket_id='1' core_id='0' siblings='4'/> <cpu id='5' socket_id='1' core_id='1' siblings='5'/> <cpu id='6' socket_id='1' core_id='2' siblings='6'/> <cpu id='7' socket_id='1' core_id='3' siblings='7'/> </cpus> </cell> </cells> </topology> ...

The NUMA nodes are each represented by a <cell> entry which lists the CPUs available within the node, the memory available within the node - including the page sizes, and the distance between the node and its siblings. This is all crucial information for OpenStack Compute to have access to when scheduling and building guest virtual machine instances for optimal placement.

CPU Pinning in OpenStack

Today we will be configuring an OpenStack Compute environment to support the pinning of virtual machine instances to dedicated physical CPU cores. To facilitate this we will walk through the process of:

Reserving dedicated cores on the compute host(s) for host processes, avoiding host process and guest virtual machine instances from fighting for the same CPU cores;
Reserving dedicated cores on the compute host(s) for the virtual machine instances themselves;
Enabling the required scheduler filters;
Creating a host aggregate to add all hosts configured for CPU pinning to;
Creating a performance focused flavor to target this host aggregate; and
Launching an instance with CPU pinning!

Finally we will take a look at the Libvirt XML of the resulting guest to examine how the changes made impact the way the guest is created on the host.

For my demonstration platform I will be using Red Hat Enterprise Linux OpenStack Platform 6 which while itself based on the OpenStack “Juno” code base includes backports to add the features referred to in this post. You can obtain an evaluation copy, or try out the Kilo-based packages currently being released by the RDO community project.

Compute Node Configuration

For the purposes of this deployment I am using a small environment with a single controller node and two compute nodes, setup using PackStack. The controller node hosts the OpenStack API services, databases, message queues, and the scheduler. The compute nodes run the Compute agent, Libvirt, and other components required to actually launch KVM virtual machines.

The hosts being used for my demonstration have eight CPU cores, numbered 0-7, spread across two NUMA nodes. NUMA node 0 contains CPU cores 0-3 while NUMA node 1 contains CPU cores 4-7. For the purposes of demonstration I am going to reserve two cores for host processes on each NUMA node - cores 0, 1, 4, and 5.

In a real deployment the number of processor cores to reserve for host processes will be more or less depending on the observed performance of the host in response to the typical workloads present in the environment and will need to be modified accordingly.

The remaining four CPU cores - cores 2, 3, 6, and 7 - will be removed from the pool used by the general kernel balancing and scheduling algorithms to place processes and isolated specifically for use when placing guest virtual machine instances. This is done by using the isolcpus kernel argument.

In this example I will be using all of these isolated cores for guests, in some deployments it may be desirable to instead dedicate one or more of these cores to an intensive host process, for example a virtual switch, by manually pinning it to an isolated CPU core as well.

	Node 0		Node 1
Host Processes	Core 0	Core 1	Core 4	Core 5
Guests	Core 2	Core 3	Core 6	Core 7

On each Compute node that pinning of virtual machines will be permitted on open the /etc/nova/nova.conf file and make the following modifications:

Set the vcpu_pin_set value to a list or range of physical CPU cores to reserve for virtual machine processes. OpenStack Compute will ensure guest virtual machine instances are pinned to these CPU cores. Using my example host I will reserve two cores in each NUMA node - note that you can also specify ranges, e.g. 2-3,6-7:
- ```
vcpu_pin_set=2,3,6,7
```

Set the reserved_host_memory_mb to reserve RAM for host processes. For the purposes of testing I am going to use the default of 512 MB:
- ```
reserved_host_memory_mb=512
```

Once these changes to the Compute configuration have been made, restart the Compute agent on each host:

# systemctl restart openstack-nova-compute.service

At this point if we created a guest we would already see some changes in the XML, pinning the guest vCPU(s) to the cores listed in vcpu_pin_set:

<vcpu placement='static' cpuset='2-3,6-7'>1</vcpu>

Now that we have set up the guest virtual machine instances so that they will only be allowed to run on cores 2, 3, 6, and, 7 we must also set up the host processes so that they will not run on these cores - restricting themselves instead to cores 0, 1, 4, and 5. To do this we must set the isolcpus kernel argument - adding this requires editing the system’s boot configuration.

On the Red Hat Enterprise Linux 7 systems used in this example this is done using grubby to edit the configuration:

# grubby --update-kernel=ALL --args="isolcpus=2,3,6,7"

We must then run grub2-install <device> to update the boot record. Be sure to specify the correct boot device for your system! In my case the correct device is /dev/sda:

# grub2-install /dev/sda

The resulting kernel command line used for future boots of the system to isolate cores 2, 3, 6, and 7 will look similar to this:

linux16 /vmlinuz-3.10.0-229.1.2.el7.x86_64 root=/dev/mapper/rhel-root ro rd.lvm.lv=rhel/root crashkernel=auto  rd.lvm.lv=rhel/swap vconsole.font=latarcyrheb-sun16 vconsole.keymap=us rhgb quiet LANG=en_US.UTF-8 isolcpus=2,3,6,7

Remember, these are cores we want the guest virtual machine instances to be pinned to. After running grub2-install reboot the system to pick up the configuration changes.

Scheduler Configuration

On each node where the OpenStack Compute Scheduler (openstack-nova-scheduler) runs edit /etc/nova/nova.conf. Add the AggregateInstanceExtraSpecFilter and NUMATopologyFilter values to the list of scheduler_default_filters. These filters are used to segregate the compute nodes that can be used for CPU pinning from those that can not and to apply NUMA aware scheduling rules when launching instances:

scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,
ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,CoreFilter,
NUMATopologyFilter,AggregateInstanceExtraSpecsFilter

Once the change has been applied, restart the openstack-nova-scheduler service:

# systemctl restart openstack-nova-scheduler.service

This will provide for the configuration changes to be applied and the newly added scheduler filters to be added.

Final Preparation

We are now very close to being able to launch virtual machine instances marked for dedicated compute resources and pinned to physical resources accordingly. Perform the following steps on a system with the OpenStack Compute command-line interface installed and with your OpenStack credentials loaded.

Create the performance host aggregate for hosts that will received pinning requests:

$ nova aggregate-create performance

+----+-------------+-------------------+-------+----------+

| Id | Name        | Availability Zone | Hosts | Metadata |

+----+-------------+-------------------+-------+----------+

| 1  | performance | -                 |       |          |

+----+-------------+-------------------+-------+----------+

Set metadata on the performance aggregate, this will be used to match the flavor we create shortly - here we are using the arbitrary key pinned and setting it to true:

$ nova aggregate-set-metadata 1 pinned=true

Metadata has been successfully updated for aggregate 1.

+----+-------------+-------------------+-------+---------------+

| Id | Name        | Availability Zone | Hosts | Metadata      |

+----+-------------+-------------------+-------+---------------+

| 1  | performance | -                 |       | 'pinned=true' |

+----+-------------+-------------------+-------+---------------+

Create the normal aggregate for all other hosts:

$ nova aggregate-create normal

+----+--------+-------------------+-------+----------+

| Id | Name   | Availability Zone | Hosts | Metadata |

+----+--------+-------------------+-------+----------+

| 2  | normal | -                 |       |          |

+----+--------+-------------------+-------+----------+

Set metadata on the normal aggregate, this will be used to match all existing ‘normal’ flavors - here we are using the same key as before and setting it to false.

$ nova aggregate-set-metadata 2 pinned=false

Metadata has been successfully updated for aggregate 2.

+----+--------+-------------------+-------+----------------+

| Id | Name   | Availability Zone | Hosts | Metadata       |

+----+--------+-------------------+-------+----------------+

| 2  | normal | -                 |       | 'pinned=false' |

+----+--------+-------------------+-------+----------------+

Before creating the new flavor for performance intensive instances update all existing flavors so that their extra specifications match them to the compute hosts in the normal aggregate:

$ for FLAVOR in `nova flavor-list | cut -f 2 -d ' ' | grep -o [0-9]*`; \

     do nova flavor-key ${FLAVOR} set \

             "aggregate_instance_extra_specs:pinned"="false"; \

  done

Create a new flavor for performance intensive instances. Here we are creating the m1.small.performance flavor, based on the values used in the existing m1.small flavor. The differences in behaviour between the two will be the result of the metadata we add to the new flavor shortly.

$ nova flavor-create m1.small.performance 6 2048 20 2

+----+----------------------+-----------+------+-----------+------+-------+

| ID | Name                 | Memory_MB | Disk | Ephemeral | Swap | VCPUs |

+----+----------------------+-----------+------+-----------+------+-------+

| 6  | m1.small.performance | 2048      | 20   | 0         |      | 2     |

+----+----------------------+-----------+------+-----------+------+-------+

Set the hw:cpy_policy flavor extra specification to dedicated. This denotes that all instances created using this flavor will require dedicated compute resources and be pinned accordingly.

$ nova flavor-key 6 set hw:cpu_policy=dedicated

Set the aggregate_instance_extra_specs:pinned flavor extra specification to true. This denotes that all instances created using this flavor will be sent to hosts in host aggregates with pinned=true in their aggregate metadata:

$ nova flavor-key 6 set aggregate_instance_extra_specs:pinned=true

Finally, we must add some hosts to our performance host aggregate. Hosts that are not intended to be targets for pinned instances should be added to the normal host aggregate:

$ nova aggregate-add-host 1 compute1.nova

Host compute1.nova has been successfully added for aggregate 1

+----+-------------+-------------------+----------------+---------------+

| Id | Name        | Availability Zone | Hosts          | Metadata      |

+----+-------------+-------------------+----------------+---------------+

| 1  | performance | -                 | 'compute1.nova'| 'pinned=true' |

+----+-------------+-------------------+----------------+---------------+

$ nova aggregate-add-host 2 compute2.nova

Host compute2.nova has been successfully added for aggregate 2

+----+-------------+-------------------+----------------+---------------+

| Id | Name        | Availability Zone | Hosts          | Metadata      |

+----+-------------+-------------------+----------------+---------------+

| 2  | normal      | -                 | 'compute2.nova'| 'pinned=false'|

+----+-------------+-------------------+----------------+---------------+

Verifying the Configuration

Now that we’ve completed all the configuration, we need to verify that all is well with the world. First, we launch a guest using our newly created flavor:

$ nova boot --image rhel-guest-image-7.1-20150224 \

            --flavor m1.small.performance test-instance

Assuming the instance launches, we can verify where it was placed by checking the OS-EXT-SRV-ATTR:hypervisor_hostname attribute in the output of the nova show test-instance command. After logging into the returned hypervisor directly using SSH we can use the virsh tool, which is part of Libvirt, to extract the XML of the running guest:

# virsh list

 Id        Name                               State

----------------------------------------------------

 1         instance-00000001                  running

# virsh dumpxml instance-00000001

...

The resultant output will be quite long, but there are some key elements related to NUMA layout and vCPU pinning to focus on:

As you might expect the vCPU placement for the 2 vCPUs remains static though a cpuset range is no longer specified alongside it - instead the more specific placement definition defined later on are used:

<vcpu placement='static'>2</vcpu>

The vcpupin, and emulatorpin elements have been added. These pin the the virtual machine instance’s vCPU cores and the associated emulator threads respectively to physical host CPU cores. In the current implementation the emulator threads are pinned to the union of all physical CPU cores associated with the guest (physical CPU cores 2-3).

<cputune>

<vcpupin vcpu='0' cpuset='2'/>

<vcpupin vcpu='1' cpuset='3'/>

<emulatorpin cpuset='2-3'/>

</cputune>

The numatune element, and the associated memory and memnode elements have been added - in this case resulting in the guest memory being strictly taken from node 0.

<numatune>

        <memory mode='strict' nodeset='0'/>

        <memnode cellid='0' mode='strict' nodeset='0'/>

</numatune>

The cpu element contains updated information about the NUMA topology exposed to the guest itself, the topology that the guest operating system will see:

<cpu>

        <topology sockets='2' cores='1' threads='1'/>

        <numa>

          <cell id='0' cpus='0-1' memory='2097152'/>

        </numa>

</cpu>

In this case as the new flavor introduced, and as a result the example guest, only contains a single vCPU the NUMA topology exposed is relatively simple. None the less the guest will still benefit from the performance improvements available through the pinning of its virtual CPU and memory resources to dedicated physical ones. This of course comes at the cost of implicitly disabling overcommitting of these same resources - the scheduler handles this transparently when CPU pinning is being applied. This a trade off that needs to be carefully balanced depending on workload characteristics.

In future blog posts in this series we will use this same example installation to look at the how OpenStack Compute works when dealing with larger and more complex guest topologies, the use of large pages to back guest memory, and the impact of PCIe device locality for guests using SR-IOV networking functionality.

Want to learn more about OpenStack Compute or the Libvirt/KVM driver for it? Catch my OpenStack Compute 101 and Kilo Libvirt/KVM Driver Update presentations at OpenStack Summit Vancouver - May 18-22.

Update 2015-08-04: Eagle-eyed readers have asked how the CPU overcommit ratio, which defaults to 16.0 (that is the scheduler treats each pCPU core as 16 vCPUs), intersects with the CPU pinning functionality described in this post. When dedicated resourcing (CPU pinning) is requested for a guest then it is assumed there is *no* overcommit (or more accurately, an overcommit ratio of 1.0). When dedicated resourcing is not requested for a guest then the normal overcommit ratio set for the environment is applied. This is why it is currently recommended that host aggregates are used to separate guests with dedicated resourcing requirements from those without dedicated resourcing requirements. This removes the chance for a guest expecting dedicated resources from sharing resources via overcommit.

About the author

Steve Gordon

Browse by channel

Explore all channels

Platform products

Try & buy

Featured cloud services

By category

By organization type

By customer

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Communities

Recommendations

Select a language

Select a language

Driving in the Fast Lane - CPU Pinning and NUMA Topology Awareness in OpenStack Compute

What is NUMA topology?

How does this apply to virtualization?

CPU Pinning in OpenStack

Compute Node Configuration

Scheduler Configuration

Final Preparation

Verifying the Configuration

About the author

Steve Gordon

More like this

Browse by channel

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links