Huawei PaaS Team
In this document we propose a design for the “Control Plane” of Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of this work please refer to this proposal. The document is arranged as following. First we briefly list scenarios and use cases that motivate K8S federation work. These use cases drive the design and they also verify the design. We summarize the functionality requirements from these use cases, and define the “in scope” functionalities that will be covered by this design (phase one). After that we give an overview of the proposed architecture, API and building blocks. And also we go through several activity flows to see how these building blocks work together to support use cases.
There are many reasons why customers may want to build a K8S federation:
Here are the functionality requirements derived from above use cases:
It’s difficult to have a perfect design with one click that implements all the above requirements. Therefore we will go with an iterative approach to design and build the system. This document describes the phase one of the whole work. In phase one we will cover only the following objectives:
The following parts are NOT covered in phase one:
The overall architecture of a control plane is shown as following:
Some design principles we are following in this architecture:
The API Server in the Ubernetes control plane works just like the API
Server in K8S. It talks to a distributed key-value store to persist,
retrieve and watch API objects. This store is completely distinct
from the kubernetes key-value stores (etcd) in the underlying
kubernetes clusters. We still use etcd
as the distributed
storage so customers don’t need to learn and manage a different
storage system, although it is envisaged that other storage systems
(consol, zookeeper) will probably be developedand supported over
time.
The Ubernetes Scheduler schedules resources onto the underlying
Kubernetes clusters. For example it watches for unscheduled Ubernetes
replication controllers (those that have not yet been scheduled onto
underlying Kubernetes clusters) and performs the global scheduling
work. For each unscheduled replication controller, it calls policy
engine to decide how to spit workloads among clusters. It creates a
Kubernetes Replication Controller on one ore more underlying cluster,
and post them back to etcd
storage.
One sublety worth noting here is that the scheduling decision is arrived at by combining the application-specific request from the user (which might include, for example, placement constraints), and the global policy specified by the federation administrator (for example, "prefer on-premise clusters over AWS clusters" or "spread load equally across clusters").
The cluster controller performs the following two kinds of work:
cluster
API object. An alternative design might be to run a pod
in each underlying cluster that reports metrics for that cluster to
the Ubernetes control plane. Which approach is better remains an
open topic of discussion.The Ubernetes service controller is a federation-level implementation of K8S service controller. It watches service resources created on control plane, creates corresponding K8S services on each involved K8S clusters. Besides interacting with services resources on each individual K8S clusters, the Ubernetes service controller also performs some global DNS registration work.
Cluster is a new first-class API object introduced in this design. For
each registered K8S cluster there will be such an API resource in
control plane. The way clients register or deregister a cluster is to
send corresponding REST requests to following URL:
/api/{$version}/clusters
. Because control plane is behaving like a
regular K8S client to the underlying clusters, the spec of a cluster
object contains necessary properties like K8S cluster address and
credentials. The status of a cluster API object will contain
following information:
$version.clusterSpec
Name |
Description |
Required |
Schema |
Default |
Address |
address of the cluster |
yes |
address |
|
Credential |
the type (e.g. bearer token, client
certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users) |
yes |
string |
$version.clusterStatus
Name |
Description |
Required |
Schema |
Default |
Phase |
the recently observed lifecycle phase of the cluster |
yes |
enum |
|
Capacity |
represents the available resources of a cluster |
yes |
any |
|
ClusterMeta |
Other cluster metadata like the version |
yes |
ClusterMeta |
For simplicity we didn’t introduce a separate “cluster metrics” API object here. The cluster resource metrics are stored in cluster status section, just like what we did to nodes in K8S. In phase one it only contains available CPU resources and memory resources. The cluster controller will periodically poll the underlying cluster API Server to get cluster capability. In phase one it gets the metrics by simply aggregating metrics from all nodes. In future we will improve this with more efficient ways like leveraging heapster, and also more metrics will be supported. Similar to node phases in K8S, the “phase” field includes following values:
Below is the state transition diagram.
A global workload submitted to control plane is represented as a replication controller in the Cluster Federation control plane. When a replication controller is submitted to control plane, clients need a way to express its requirements or preferences on clusters. Depending on different use cases it may be complex. For example:
Below is a sample of the YAML to create such a replication controller.
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx-controller
spec:
replicas: 5
selector:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
clusterSelector:
name in (Foo, Bar)
Currently clusterSelector (implemented as a LabelSelector) only supports a simple list of acceptable clusters. Workloads will be evenly distributed on these acceptable clusters in phase one. After phase one we will define syntax to represent more advanced constraints, like cluster preference ordering, desired number of splitted workloads, desired ratio of workloads spread on different clusters, etc.
Besides this explicit “clusterSelector” filter, a workload may have some implicit scheduling restrictions. For example it defines “nodeSelector” which can only be satisfied on some particular clusters. How to handle this will be addressed after phase one.
The Service API object exposed by the Cluster Federation is similar to service objects on Kubernetes. It defines the access to a group of pods. The federation service controller will create corresponding Kubernetes service objects on underlying clusters. These are detailed in a separate design document: Federated Services.
In phase one we only support scheduling replication controllers. Pod scheduling will be supported in later phase. This is primarily in order to keep the Cluster Federation API compatible with the Kubernetes API.
The below diagram shows how workloads are scheduled on the Cluster Federation control\ plane:
There is a potential race condition here. Say at time T1 the control plane learns there are m available resources in a K8S cluster. As the cluster is working independently it still accepts workload requests from other K8S clients or even another Cluster Federation control plane. The Cluster Federation scheduling decision is based on this data of available resources. However when the actual RC creation happens to the cluster at time T2, the cluster may don’t have enough resources at that time. We will address this problem in later phases with some proposed solutions like resource reservation mechanisms.
This part has been included in the section “Federated Service” of document “Federated Cross-cluster Load Balancing and Service Discovery Requirements and System Design)”. Please refer to that document for details.