John Tucker

Jul 5, 2021

6 min read

Kubernetes CPU Resource Requests at Runtime

While it is well documented how CPU resource request impact the scheduling of Pods to Nodes, it is less clear of the impact once Pods (and their Containers) are running on a Node.

When thinking about resource requests, one focuses on its impact on scheduling Pods to Nodes.

When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.

— Kubernetes — Managing Resources for Containers

When thinking about resource limits, one focuses on its impact on Pods when running on a Node.

If the node where a Pod is running has enough of a resource available, it’s possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit.

— Kubernetes — Managing Resources for Containers

What about the impact of resource requests on Pods when running on a Node? In particular, the impact of CPU resource requests?

Please note: Up until now, I had been running under the assumption that resource request are only relevant during scheduling; other than limits, resources on a Node are the “wild west”.

The documentation on this is fairly sparse; the most that one will find is that it has something to do with Linux cgroups and CPUShares.

The CPUShares value provides tasks in a cgroup with a relative amount of CPU time. Once the system has mounted the cpu cgroup controller, you can use the file cpu.shares to define the number of shares allocated to the cgroup. CPU time is determined by dividing the cgroup’s CPUShares by the total number of defined CPUShares on the system.

— RedHat — How to manage cgroups with CPUShares

By logging into a Node and examining its Linux cgroup configuration, one will indeed find that CPU resource requests are being used to allocate CPU time to Pod’s containers.

Please note: While it should not matter, the examples were run on GKE clusters running Kubernetes 1.19.

The particular scenario considered is a Node with 2 CPUs running three Pods (other than from DaemonSets) configured with a single container with the following resource requests and limits:

  • besteffort: no resource requests

Please note: The Pod names are consistent with the Pod’s quality of service (QoS) classes.

One can find the Node’s cgroup configuration in files (and subdirectories) in /sys/fs/cgroup/cpu,cpuacct/.

Things to observe:

  • The root cgroup has 1024 CPUShares; but regardless of value represents 100% of the available CPU time
  • Each Containers’ CPUShare (cgroup name based on the Containers’ containerId) is based on its CPU resource request, e.g., for the Container in the burstable Pod its CPUShares is 204 [204 = 200 * 1024 / 1000]. For Containers without a CPU resource request, they get the minimum CPUShare of 2

We can draw some broad conclusions from this allocation of CPU time:

  • Containers of Pods with besteffort QoS always get allocated less than 1% of allocated CPU time

To illustrate the impact of allocated CPU time, we can observe CPU utilization of each of the three workloads when being applied a load on a Node with 2 CPUs.

Please note: In this example, the workloads and load Pods use spec.nodeName values to ensure the workload Pods are all scheduled to the same Node and the load Pods are on different Nodes.

Things to observe:

  • While the workload’s Containers were allocated a small (besteffort <1%, burstable / guaranteed 8%), their actual CPU utilization exceeded their allocation; this is because once the CPU has been allocated the remaining CPU is available to be shared across the cgroups

Please note: One might have assumed that the remaining CPU allocation would have been distributed amongst the cgroups based on their CPUShares; thought I saw this documented somewhere but the evidence does not support this.

That is about it; whew!