Status: Design & Implementation in progress.
Contact @HaiyangDING for questions & suggestions.
In current Kubernetes design, there is only one default scheduler in a Kubernetes cluster. However it is common that multiple types of workload, such as traditional batch, DAG batch, streaming and user-facing production services, are running in the same cluster and they need to be scheduled in different ways. For example, in Omega batch workload and service workload are scheduled by two types of schedulers: the batch workload is scheduled by a scheduler which looks at the current usage of the cluster to improve the resource usage rate and the service workload is scheduled by another one which considers the reserved resources in the cluster and many other constraints since their performance must meet some higher SLOs. Mesos has done a great work to support multiple schedulers by building a two-level scheduling structure. This proposal describes how Kubernetes is going to support multi-scheduler so that users could be able to run their user-provided scheduler(s) to enable some customized scheduling behavior as they need. As previously discussed in #11793, #9920 and #11470, the design of the multiple scheduler should be generic and includes adding a scheduler name annotation to separate the pods. It is worth mentioning that the proposal does not address the question of how the scheduler name annotation gets set although it is reasonable to anticipate that it would be set by a component like admission controller/initializer, as the doc currently does.
Before going to the details of this proposal, below lists a number of the methods to extend the scheduler:
Separating the pods
Each pod should be scheduled by only one scheduler. As for implementation, a pod should have an additional field to tell by which scheduler it wants to be scheduled. Besides, each scheduler, including the default one, should have a unique logic of how to add unscheduled pods to its to-be-scheduled pod queue. Details will be explained in later sections.
Dealing with conflicts
Different schedulers are essentially separated processes. When all schedulers try to schedule their pods onto the nodes, there might be conflicts.
One example of the conflicts is resource racing: Suppose there be a pod1
scheduled by
my-scheduler
requiring 1 CPU's request, and a pod2
scheduled by kube-scheduler
(k8s native
scheduler, acting as default scheduler) requiring 2 CPU's request, while node-a
only has 2.5
free CPU's, if both schedulers all try to put their pods on node-a
, then one of them would eventually
fail when Kubelet on node-a
performs the create action due to insufficient CPU resources.
This conflict is complex to deal with in api-server and etcd. Our current solution is to let Kubelet to do the conflict check and if the conflict happens, effected pods would be put back to scheduler and waiting to be scheduled again. Implementation details are in later sections.
We definitely want the multi-scheduler design to be a generic mechanism. The following lists the changes we want to make in the first step.
scheduler.alpha.kubernetes.io/name: scheduler-name
, this is used to
separate pods between schedulers. scheduler-name
should match one of the schedulers' scheduler-name
scheduler-name
to each scheduler. It is done by hardcode or as command-line argument. The
Kubernetes native scheduler (now kube-scheduler
process) would have the name as kube-scheduler
The scheduler-name
plays an important part in separating the pods between different schedulers.
Pods are statically dispatched to different schedulers based on scheduler.alpha.kubernetes.io/name: scheduler-name
annotation and there should not be any conflicts between different schedulers handling their pods, i.e. one pod must
NOT be claimed by more than one scheduler. To be specific, a scheduler can add a pod to its queue if and only if:
The scheduler-name
specified in the pod's annotation scheduler.alpha.kubernetes.io/name: scheduler-name
matches the scheduler-name
of the scheduler.
The only one exception is the default scheduler. Any pod that has no scheduler.alpha.kubernetes.io/name: scheduler-name
annotation is assumed to be handled by the "default scheduler". In the first version of the multi-scheduler feature,
the default scheduler would be the Kubernetes built-in scheduler with scheduler-name
as kube-scheduler
.
The Kubernetes build-in scheduler will claim any pod which has no scheduler.alpha.kubernetes.io/name: scheduler-name
annotation or which has scheduler.alpha.kubernetes.io/name: kube-scheduler
. In the future, it may be possible to
change which scheduler is the default for a given cluster.
Dealing with conflicts. All schedulers must use predicate functions that are at least as strict as the ones that Kubelet applies when deciding whether to accept a pod, otherwise Kubelet and scheduler may get into an infinite loop where Kubelet keeps rejecting a pod and scheduler keeps re-scheduling it back the same node. To make it easier for people who write new schedulers to obey this rule, we will create a library containing the predicates Kubelet uses. (See issue #12744.)
In summary, in the initial version of this multi-scheduler design, we will achieve the following:
scheduler.alpha.kubernetes.io/name: kube-scheduler
or the user does not explicitly
sets this annotation in the template, it will be picked up by default schedulerscheduler-name
, it will be picked up by the scheduler of
specified scheduler-name
scheduler-name
, the pod will not be picked by any scheduler.
The pod will keep PENDING. kind: Pod
apiVersion: v1
metadata:
name: pod-abc
labels:
foo: bar
annotations:
scheduler.alpha.kubernetes.io/name: my-scheduler
This pod will be scheduled by "my-scheduler" and ignored by "kube-scheduler". If there is no running scheduler of name "my-scheduler", the pod will never be scheduled.
--randomize-node-selection=N
to scheduler, setting this flag would cause the scheduler to pick
randomly among the top N nodes instead of the one with the highest score.