Author(s): Vishnu Kannan (vishh@), Ananya Kumar (@AnanyaKumar) Last Updated: 5/17/2016
Status: Implemented
This document presents the design of resource quality of service for containers in Kubernetes, and describes use cases and implementation details.
This document describes the way Kubernetes provides different levels of Quality of Service to pods depending on what they request. Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee.
Specifically, for each resource, containers specify a request, which is the amount of that resource that the system will guarantee to the container, and a limit which is the maximum amount that the system will allow the container to use. The system computes pod level requests and limits by summing up per-resource requests and limits across all containers. When request == limit, the resources are guaranteed, and when request < limit, the pod is guaranteed the request but can opportunistically scavenge the difference between request and limit if they are not being used by other containers. This allows Kubernetes to oversubscribe nodes, which increases utilization, while at the same time maintaining resource guarantees for the containers that need guarantees. Borg increased utilization by about 20% when it started allowing use of such non-guaranteed resources, and we hope to see similar improvements in Kubernetes.
For each resource, containers can specify a resource request and limit, 0 <= request <=
Node Allocatable
& request <= limit <= Infinity
.
If a pod is successfully scheduled, the container is guaranteed the amount of resources requested.
Scheduling is based on requests
and not limits
.
The pods and its containers will not be allowed to exceed the specified limit.
How the request and limit are enforced depends on whether the resource is compressible or incompressible.
In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: Guaranteed, Burstable, and Best-Effort, in decreasing order of priority.
The relationship between "Requests and Limits" and "QoS Classes" is subtle. Theoretically, the policy of classifying pods into QoS classes is orthogonal to the requests and limits specified for the container. Hypothetically, users could use an (currently unplanned) API to specify whether a pod is guaranteed or best-effort. However, in the current design, the policy of classifying pods into QoS classes is intimately tied to "Requests and Limits" - in fact, QoS classes are used to implement some of the memory guarantees described in the previous section.
Pods can be of one of 3 different classes:
limits
and optionally requests
(not equal to 0
) are set for all resources across all containers and they are equal, then the container is classified as Guaranteed.Examples:
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 100Mi
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
requests:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 100m
memory: 100Mi
requests
and optionally limits
are set (not equal to 0
) for one or more resources across one or more containers, and they are not equal, then the pod is classified as Burstable.
When limits
are not specified, they default to the node capacity.Examples:
Container bar
has not resources specified.
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
requests:
cpu: 10m
memory: 1Gi
name: bar
Container foo
and bar
have limits set for different resources.
containers:
name: foo
resources:
limits:
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
Container foo
has no limits set, and bar
has neither requests nor limits specified.
containers:
name: foo
resources:
requests:
cpu: 10m
memory: 1Gi
name: bar
requests
and limits
are not set for all of the resources, across all containers, then the pod is classified as Best-Effort.Examples:
containers:
name: foo
resources:
name: bar
resources:
Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.
Memory is an incompressible resource and so let's discuss the semantics of memory management a bit.
Best-Effort pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. These containers can use any amount of free memory in the node though.
Guaranteed pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
Burstable pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist.
Pod OOM score configuration
Best-effort
- Set OOM_SCORE_ADJ: 1000
- So processes in best-effort containers will have an OOM_SCORE of 1000
Guaranteed
- Set OOM_SCORE_ADJ: -998
- So processes in guaranteed containers will have an OOM_SCORE of 0 or 1
Burstable
- If total memory request > 99.8% of available memory, OOM_SCORE_ADJ: 2
- Otherwise, set OOM_SCORE_ADJ to 1000 - 10 * (% of memory requested)
- This ensures that the OOM_SCORE of burstable pod is > 1
- If memory request is `0`, OOM_SCORE_ADJ is set to `999`.
- So burstable pods will be killed if they conflict with guaranteed pods
- If a burstable pod uses less memory than requested, its OOM_SCORE < 1000
- So best-effort pods will be killed if they conflict with burstable pods using less than requested memory
- If a process in burstable pod's container uses more memory than what the container had requested, its OOM_SCORE will be 1000, if not its OOM_SCORE will be < 1000
- Assuming that a container typically has a single big process, if a burstable pod's container that uses more memory than requested conflicts with another burstable pod's container using less memory than requested, the former will be killed
- If burstable pod's containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure "Request and Limit" guarantees.
Pod infra containers or Special Pod init process
The above implementation provides for basic oversubscription with protection, but there are a few known limitations.
An alternative is to have user-specified numerical priorities that guide Kubelet on which tasks to kill (if the node runs out of memory, lower priority tasks will be killed). A strict hierarchy of user-specified numerical priorities is not desirable because: