Optimization of VMs by NUMA

Introduction CPU & NUMA architecture

The main task of every CPU is to process data. But here is the misconception, the faster the CPU (or the more CPUs I allocate), the faster the data will be processed. This is unfortunately not quite that easy, because before the CPU can process the data, it must be read out by the slower system RAM and that latency can slow the CPU processing. In order to minimize the time the CPU is waiting on reading data, CPU architectures include on-chip memory caches (local RAM) that are much faster than RAM (the access is up to 95% faster).

Optimization of VMs by NUMA

When the CPU reads from local RAM the data is transferred along a bus shared by all the CPUs on a system. As the number of CPUs in a system increase the traffic along that bus increases as well, and CPUs can end up contending with each other to access RAM. This is where NUMA comes in – NUMA is designed to minimize the problem of system bus contention by increasing the number of paths between CPU and RAM.

NUMA (Non Uniform Memory Architecture) breaks up a system into nodes of associated CPUs and local RAM. NUMA nodes are optimized so that the CPUs in a node preferentially use the local RAM within that node. The result is that CPUs typically contend only with other CPUs within their NUMA node for access to RAM rather than with all the CPUs on a system.

As an example consider a system with 4 processor sockets, each with 4 cores and 128 GB RAM. Without NUMA that comes to 16 physical processors that could potentially be queued up on the same system bus to access 128 GB RAM. If this same system were broken up into 4 NUMA nodes each node would have 4 CPUs with local access to 32 GB RAM.

16 pCPU / 4 NUMA-Knoten = 4 pCPU pro NUMA-Knoten

128 GB RAM / 4 NUMA-Knoten = 32 GB RAM pro NUMA-Knoten

NUMA

NUMA in virtual environments

The NUMA node size should be considered when large server workloads such as Exchange, SQL, Citrix worker etc. virtualized. On the topic NUMA Awareness there are several studies of different manufacturers, for example in connection with XenApp, the user density could be increased by about 25% by NUMA awareness.

The following rules relating to NUMA in virtual environments:

  1. The number of virtual CPUs (vCPU) for a VM is less than or equal to physical CPUs (pCPU) in the NUMA node.
    The hypervisor assigns the VM to a home NUMA node where memory and pCPU are preferentially used. Best practices in this case are that the allocated VM memory be less than the NUMA node memory. 
  2. The number of vCPUs for a VM is greater than the number of pCPUs in the NUMA node (“Wide VMs”).
    Wide VMs are split into multiple NUMA clients with each client assigned a different home NUMA node.  For example, if a system had multiple NUMA nodes of 1 socket with 4 cores each (4 pCPUs/node) and a wide VM had 8 vCPUs then the hypervisor can divide the VM into 2 NUMA clients with 4 pCPUs each assigned to 2 different home NUMA nodes. The problem with dividing a wide VM into multiple NUMA clients is that it introduces the possibility that one of the client nodes may need to access memory from a different NUMA client node.
NUMA in virtual environments

Above I wrote “NUMA node size” – what did I mean by that? Believe it or not, all Intel chips are not created equal. And not all sockets have only one (1) underlying NUMA node. So the obvious question is which NUMA configuration is based on the purchased hardware. There are tools such as Coreinfo, for Windows operating systems, or commands that are executed in the hypervisor. 

Or

If you does not like the CLI, this information can also be obtained directly from the hardware manufacturer (white paper). From experience, however, I can say that older sockets are almost split into multiple NUMA nodes. Newer Intel chips, for example, are less or not split. This is one of the reasons why 2 vCPUs used to be recommended for XenApp Worker.

2 sockets with 4 Cores = 8 pCPU

8 pCPU / 4 NUMA-Nodes = 2 pCPU per NUMA-Node

The top calculation shows the optimal size to prevent NUMA trashing (access to NUMA foreign resources). Newer hardware has fewer NUMA nodes per socket, so the sweet spot (4-8 vCPU) has moved up.

2 sockets with 4 Cores = 8 pCPU

8 pCPU / 1 NUMA-Node = 8 pCPU per NUMA-Node

In order to correctly calculate the sizing of a Citrix Worker, the CPU-over-subscription must also be considered. Even if I always counted on a 1.5x- quota for the CPU-over-subscription a few years ago, I had to adapt this to the more modern hardware (since about 2-3 years ago). Therefore, my new quota has been around for some time, a 2.0x-over-subscription-quota. With several internal test using LoginVSI and real workloads in production environments, I’ve found that this is the optimal sweet spot in terms of user density per host.

Here’s an example of sizing a Windows Server 2016 VDA worker environment with all the important factors. The hypervisor hosts are equipped with 2 sockets, each with 20 cores and the NUMA node size 2 (Hyper-Threading active).

2 sockets * 20 Cores = 40 pCPU

40 pCPU / 2 NUMA-Nodes = 20 pCPU per NUMA-Node

20 pCPU * 2.0 CPU Over-Subscription = 40 vCPU

40 vCPU / 5 Worker = 8 vCPU per Worker

5 Worker * 2 NUMA-Nodes = 10 Worker per Host