Issue: https://github.com/kubernetes/kubernetes/issues/13984
Currently Node.Status has Capacity, but no concept of node Allocatable. We need additional parameters to serve several purposes:
This proposal deals with resource reporting through the Allocatable
field for more
reliable scheduling, and minimizing resource over commitment. This proposal does not cover
resource usage enforcement (e.g. limiting kubernetes component usage), pod eviction (e.g. when
reservation grows), or running multiple Kubelets on a single node.
NodeStatus.Capacity
,
this is total capacity read from the node instance, and assumed to be constant./system
raw
container.Add Allocatable
(4) to
NodeStatus
:
type NodeStatus struct {
...
// Allocatable represents schedulable resources of a node.
Allocatable ResourceList `json:"allocatable,omitempty"`
...
}
Allocatable will be computed by the Kubelet and reported to the API server. It is defined to be:
[Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved]
The scheduler will use Allocatable
in place of Capacity
when scheduling pods, and the Kubelet
will use it when performing admission checks.
Note: Since kernel usage can fluctuate and is out of kubernetes control, it will be reported as a separate value (probably via the metrics API). Reporting kernel usage is out-of-scope for this proposal.
KubeReserved
is the parameter specifying resources reserved for kubernetes components (4). It is
provided as a command-line flag to the Kubelet at startup, and therefore cannot be changed during
normal Kubelet operation (this may change in the future).
The flag will be specified as a serialized ResourceList
, with resources defined by the API
ResourceName
and values specified in resource.Quantity
format, e.g.:
--kube-reserved=cpu=500m,memory=5Mi
Initially we will only support CPU and memory, but will eventually support more resources. See #16889 for disk accounting.
If KubeReserved is not set it defaults to a sane value (TBD) calculated from machine capacity. If it
is explicitly set to 0 (along with SystemReserved
), then Allocatable == Capacity
, and the system
behavior is equivalent to the 1.1 behavior with scheduling based on Capacity.
In the initial implementation, SystemReserved
will be functionally equivalent to
KubeReserved
, but with a different semantic meaning. While KubeReserved
designates resources set aside for kubernetes components, SystemReserved designates resources set
aside for non-kubernetes components (currently this is reported as all the processes lumped
together in the /system
raw container).
Solution: Initially, do nothing (best effort). Let the kubernetes daemons overflow the reserved resources and hope for the best. If the node usage is less than Allocatable, there will be some room for overflow and the node should continue to function. If the node has been scheduled to capacity (worst-case scenario) it may enter an unstable state, which is the current behavior in this situation.
In the future we may set a parent cgroup for kubernetes components, with limits set
according to KubeReserved
.
API server / scheduler is not allocatable-resources aware: If the Kubelet rejects a Pod but the
scheduler expects the Kubelet to accept it, the system could get stuck in an infinite loop
scheduling a Pod onto the node only to have Kubelet repeatedly reject it. To avoid this situation,
we will do a 2-stage rollout of Allocatable
. In stage 1 (targeted for 1.2), Allocatable
will
be reported by the Kubelet and the scheduler will be updated to use it, but Kubelet will continue
to do admission checks based on Capacity
(same as today). In stage 2 of the rollout (targeted
for 1.3 or later), the Kubelet will start doing admission checks based on Allocatable
.
API server expects Allocatable
but does not receive it: If the kubelet is older and does not
provide Allocatable
in the NodeStatus
, then Allocatable
will be
defaulted to
Capacity
(which will yield today's behavior of scheduling based on capacity).
The community should be notified that an update to schedulers is recommended, but if a scheduler is not updated it falls under the above case of "scheduler is not allocatable-resources aware".