@davidopp, @erictune, @briangrant
July 2015
A rescheduler is an agent that proactively causes currently-running Pods to be moved, so as to optimize some objective function for goodness of the layout of Pods in the cluster. (The objective function doesn't have to be expressed mathematically; it may just be a collection of ad-hoc rules, but in principle there is an objective function. Implicitly an objective function is described by the scheduler's predicate and priority functions.) It might be triggered to run every N minutes, or whenever some event happens that is known to make the objective function worse (for example, whenever any Pod goes PENDING for a long time.)
A rescheduler is useful because without a rescheduler, scheduling decisions are only made at the time Pods are created. But later on, the state of the cell may have changed in some way such that it would be better to move the Pod to another node.
There are two categories of movements a rescheduler might trigger: coalescing and spreading.
This is the most common use case. Cluster layout changes over time. For example, run-to-completion Pods terminate, producing free space in their wake, but that space is fragmented. This fragmentation might prevent a PENDING Pod from scheduling (there are enough free resource for the Pod in aggregate across the cluster, but not on any single node). A rescheduler can coalesce free space like a disk defragmenter, thereby producing enough free space on a node for a PENDING Pod to schedule. In some cases it can do this just by moving Pods into existing holes, but often it will need to evict (and reschedule) running Pods in order to create a large enough hole.
A second use case for a rescheduler to coalesce pods is when it becomes possible to support the running Pods on a fewer number of nodes. The rescheduler can gradually move Pods off of some set of nodes to make those nodes empty so that they can then be shut down/removed. More specifically, the system could do a simulation to see whether after removing a node from the cluster, will the Pods that were on that node be able to reschedule, either directly or with the help of the rescheduler; if the answer is yes, then you can safely auto-scale down (assuming services will still meeting their application-level SLOs).
The main use cases for spreading Pods revolve around relieving congestion on (a) highly utilized node(s). For example, some process might suddenly start receiving a significantly above-normal amount of external requests, leading to starvation of best-effort Pods on the node. We can use the rescheduler to move the best-effort Pods off of the node. (They are likely to have generous eviction SLOs, so are more likely to be movable than the Pod that is experiencing the higher load, but in principle we might move either.) Or even before any node becomes overloaded, we might proactively re-spread Pods from nodes with high-utilization, to give them some buffer against future utilization spikes. In either case, the nodes we move the Pods onto might have been in the system for a long time or might have been added by the cluster auto-scaler specifically to allow the rescheduler to rebalance utilization.
A second spreading use case is to separate antagonists. Sometimes the processes running in two different Pods on the same node may have unexpected antagonistic behavior towards one another. A system component might monitor for such antagonism and ask the rescheduler to move one of the antagonists to a new node.
The vast majority of users probably only care about rescheduling for three scenarios:
Because rescheduling is disruptive--it causes one or more already-running Pods to die when they otherwise wouldn't--a key constraint on rescheduling is that it must be done subject to disruption SLOs. There are a number of ways to specify these SLOs--a global rate limit across all Pods, a rate limit across a set of Pods defined by some particular label selector, a maximum number of Pods that can be down at any one time among a set defined by some particular label selector, etc. These policies are presumably part of the Rescheduler's configuration.
There are a lot of design possibilities for a rescheduler. To explain them, it's easiest to start with the description of a baseline rescheduler, and then describe possible modifications. The Baseline rescheduler
Possible variations on this Baseline rescheduler are
A key design question for a Rescheduler is how much knowledge it needs about the scheduling policies used by the cluster's scheduler(s).
For scaling up the cluster, a reasonable workflow might be: