The umbrella issue: #22964
Kubelet employs a declarative pod-level interface, which acts as the sole
integration point for container runtimes (e.g., docker
and rkt
). The
high-level, declarative interface has caused higher integration and maintenance
cost, and also slowed down feature velocity for the following reasons.
docker
).SyncPod()
function, with the Pod Spec as the input argument. The runtime implements
logic to determine how to achieve the desired state based on the current
status, (re-)starts pods/containers and manages lifecycle hooks
accordingly.The goals of defining the interface are to
The non-goals include
docker
and rkt
hyper
.The existing pod-level interface will remain as it is in the near future to ensure supports of all existing runtimes are continued. Meanwhile, we will work with all parties involved to switching to the proposed interface.
The main idea of this proposal is to adopt an imperative container-level interface, which allows Kubelet to directly control the lifecycles of the containers.
Pod is composed of a group of containers in an isolated environment with resource constraints. In Kubernetes, pod is also the smallest schedulable unit. After a pod has been scheduled to the node, Kubelet will create the environment for the pod, and add/update/remove containers in that environment to meet the Pod Spec. To distinguish between the environment and the pod as a whole, we will call the pod environment PodSandbox.
The container runtimes may interpret the PodSandBox concept differently based on how it operates internally. For runtimes relying on hypervisor, sandbox represents a virtual machine naturally. For others, it can be Linux namespaces.
In short, a PodSandbox should have the following features.
NOTE: The resource specification does not include externalized costs to container setup that are not currently trackable as Pod constraints, e.g., filesystem setup, container image pulling, etc.
A container in a PodSandbox maps to an application in the Pod Spec. For Linux containers, they are expected to share at least network and IPC namespaces, with sharing more namespaces discussed in #1615.
Below is an example of the proposed interfaces.
// PodSandboxManager contains basic operations for sandbox.
type PodSandboxManager interface {
Create(config *PodSandboxConfig) (string, error)
Delete(id string) (string, error)
List(filter PodSandboxFilter) []PodSandboxListItem
Status(id string) PodSandboxStatus
}
// ContainerRuntime contains basic operations for containers.
type ContainerRuntime interface {
Create(config *ContainerConfig, sandboxConfig *PodSandboxConfig, PodSandboxID string) (string, error)
Start(id string) error
Stop(id string, timeout int) error
Remove(id string) error
List(filter ContainerFilter) ([]ContainerListItem, error)
Status(id string) (ContainerStatus, error)
Exec(id string, cmd []string, streamOpts StreamOptions) error
}
// ImageService contains image-related operations.
type ImageService interface {
List() ([]Image, error)
Pull(image ImageSpec, auth AuthConfig) error
Remove(image ImageSpec) error
Status(image ImageSpec) (Image, error)
Metrics(image ImageSpec) (ImageMetrics, error)
}
type ContainerMetricsGetter interface {
ContainerMetrics(id string) (ContainerMetrics, error)
}
All functions listed above are expected to be thread-safe.
The PodSandbox’s lifecycle is decoupled from the containers, i.e., a sandbox is created before any containers, and can exist after all containers in it have terminated.
Assume there is a pod with a single container C. To start a pod:
create sandbox Foo --> create container C --> start container C
To delete a pod:
stop container C --> remove container C --> delete sandbox Foo
The container runtime must not apply any transition (such as starting a new container) unless explicitly instructed by Kubelet. It is Kubelet's responsibility to enforce garbage collection, restart policy, and otherwise react to changes in lifecycle.
The only transitions that are possible for a container are described below:
() -> Created // A container can only transition to created from the
// empty, nonexistent state. The ContainerRuntime.Create
// method causes this transition.
Created -> Running // The ContainerRuntime.Start method may be applied to a
// Created container to move it to Running
Running -> Exited // The ContainerRuntime.Stop method may be applied to a running
// container to move it to Exited.
// A container may also make this transition under its own volition
Exited -> () // An exited container can be moved to the terminal empty
// state via a ContainerRuntime.Remove call.
Kubelet is also responsible for gracefully terminating all the containers in the sandbox before deleting the sandbox. If Kubelet chooses to delete the sandbox with running containers in it, those containers should be forcibly deleted.
Note that every PodSandbox/container lifecycle operation (create, start,
stop, delete) should either return an error or block until the operation
succeeds. A successful operation should include a state transition of the
PodSandbox/container. E.g., if a Create
call for a container does not
return an error, the container state should be "created" when the runtime is
queried.
Kubernetes support updates only to a very limited set of fields in the Pod Spec. These updates may require containers to be re-created by Kubelet. This can be achieved through the proposed, imperative container-level interface. On the other hand, PodSandbox update currently is not required.
Kubernetes supports post-start and pre-stop lifecycle hooks, with ongoing discussion for supporting pre-start and post-stop hooks in #140.
These lifecycle hooks will be implemented by Kubelet via Exec
calls to the
container runtime. This frees the runtimes from having to support hooks
natively.
Illustration of the container lifecycle and hooks:
pre-start post-start pre-stop post-stop
| | | |
exec exec exec exec
| | | |
create --------> start ----------------> stop --------> remove
In order for the lifecycle hooks to function as expected, the Exec
call
will need access to the container's filesystem (e.g., mount namespaces).
There are several dimensions for container runtime extensibility.
As mentioned previously, this proposal will only address the Linux based PodSandbox and containers. All Linux-specific configuration will be grouped into one field. A container runtime is required to enforce all configuration applicable to its platform, and should return an error otherwise.
The proposed interface is experimental, i.e., it will go through (many) changes until it stabilizes. The principle is to to keep the interface minimal and extend it later if needed. This includes a several features that are still in discussion and may be achieved alternatively:
[Status quo] Declarative pod-level interface
Allow integration at both pod- and container-level interfaces
Imperative pod-level interface The interface contains only CreatePod(), StartPod(), StopPod() and RemovePod(). This implies that the runtime needs to take over container lifecycle manangement (i.e., enforce restart policy), lifecycle hooks, liveness checks, etc. Kubelet will mainly be responsible for interfacing with the apiserver, and can potentially become a very thin daemon.
Docker
shim maintenance cost is discounted.