March 2016
Within a pod there is a need to initialize local data or adapt to the current cluster environment that is not easily achieved in the current container model. Containers start in parallel after volumes are mounted, leaving no opportunity for coordination between containers without specialization of the image. If two containers need to share common initialization data, both images must be altered to cooperate using filesystem or network semantics, which introduces coupling between images. Likewise, if an image requires configuration in order to start and that configuration is environment dependent, the image must be altered to add the necessary templating or retrieval.
This proposal introduces the concept of an init container, one or more containers started in sequence before the pod's normal containers are started. These init containers may share volumes, perform network operations, and perform computation prior to the start of the remaining containers. They may also, by virtue of their sequencing, block or delay the startup of application containers until some precondition is met. In this document we refer to the existing pod containers as app containers.
This proposal also provides a high level design of volume containers, which initialize a particular volume, as a feature that specializes some of the tasks defined for init containers. The init container design anticipates the existence of volume containers and highlights where they will take future work
Each pod may have 0..N init containers defined along with the existing 1..M app containers.
On startup of the pod, after the network and volumes are initialized, the init containers are started in order. Each container must exit successfully before the next is invoked. If a container fails to start (due to the runtime) or exits with failure, it is retried according to the pod RestartPolicy. RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways pods will retry the failing init container with increasing backoff until it succeeds. To align with the design of application containers, init containers will only support "infinite retries" (RestartPolicyAlways) or "no retries" (RestartPolicyNever).
A pod cannot be ready until all init containers have succeeded. The ports
on an init container are not aggregated under a service. A pod that is
being initialized is in the Pending
phase but should have a distinct
condition. Each app container and all future init containers should have
the reason PodInitializing
. The pod should have a condition Initializing
set to false
until all init containers have succeeded, and true
thereafter.
If the pod is restarted, the Initializing
condition should be set to `false.
If the pod is "restarted" all containers stopped and started due to a node restart, change to the pod definition, or admin interaction, all init containers must execute again. Restartable conditions are defined as:
Changes to the init container spec are limited to the container image field. Altering the container image field is equivalent to restarting the pod.
Because init containers can be restarted, retried, or reexecuted, container authors should make their init behavior idempotent by handling volumes that are already populated or the possibility that this instance of the pod has already contacted a remote system.
Each init container has all of the fields of an app container. The following fields are prohibited from being used on init containers by validation:
readinessProbe
- init containers must exit for pod startup to continue,
are not included in rotation, and so cannot define readiness distinct from
completion.Init container authors may use activeDeadlineSeconds
on the pod and
livenessProbe
on the container to prevent init containers from failing
forever. The active deadline includes init containers.
Because init containers are semantically different in lifecycle from app containers (they are run serially, rather than in parallel), for backwards compatibility and design clarity they will be identified as distinct fields in the API:
pod:
spec:
containers: ...
initContainers:
- name: init-container1
image: ...
...
- name: init-container2
...
status:
containerStatuses: ...
initContainerStatuses:
- name: init-container1
...
- name: init-container2
...
This separation also serves to make the order of container initialization clear - init containers are executed in the order that they appear, then all app containers are started at once.
The name of each app and init container in a pod must be unique - it is a validation error for any container to share a name.
While pod containers are in alpha state, they will be serialized as an annotation
on the pod with the name pod.alpha.kubernetes.io/init-containers
and the status
of the containers will be stored as pod.alpha.kubernetes.io/init-container-statuses
.
Mutation of these annotations is prohibited on existing pods.
Given the ordering and execution for init containers, the following rules for resource usage apply:
So the following pod:
pod:
spec:
initContainers:
- limits:
cpu: 100m
memory: 1GiB
- limits:
cpu: 50m
memory: 2GiB
containers:
- limits:
cpu: 10m
memory: 1100MiB
- limits:
cpu: 10m
memory: 1100MiB
has an effective pod limit of cpu: 100m
, memory: 2200MiB
(highest init
container cpu is larger than sum of all app containers, sum of container
memory is larger than the max of all init containers). The scheduler, node,
and quota must respect the effective pod request/limit.
In the absence of a defined request or limit on a container, the effective request/limit will be applied. For example, the following pod:
pod:
spec:
initContainers:
- limits:
cpu: 100m
memory: 1GiB
containers:
- request:
cpu: 10m
memory: 1100MiB
will have an effective request of 10m / 1100MiB
, and an effective limit
of 100m / 1GiB
, i.e.:
pod:
spec:
initContainers:
- request:
cpu: 10m
memory: 1GiB
- limits:
cpu: 100m
memory: 1100MiB
containers:
- request:
cpu: 10m
memory: 1GiB
- limits:
cpu: 100m
memory: 1100MiB
and thus have the QoS tier Burstable (because request is not equal to limit).
Quota and limits will be applied based on the effective pod request and limit.
Pod level cGroups will be based on the effective pod request and limit, the same as the scheduler.
Container runtimes should treat the set of init and app containers as one large pool. An individual init container execution should be identical to an app container, including all standard container environment setup (network, namespaces, hostnames, DNS, etc).
All app container operations are permitted on init containers. The logs for an init container should be available for the duration of the pod lifetime or until the pod is restarted.
During initialization, app container status should be shown with the reason
PodInitializing if any init containers are present. Each init container
should show appropriate container status, and all init containers that are
waiting for earlier init containers to finish should have the reason
PendingInitialization.
The container runtime should aggressively prune failed init containers. The container runtime should record whether all init containers have succeeded internally, and only invoke new init containers if a pod restart is needed (for Docker, if all containers terminate or if the pod infra container terminates). Init containers should follow backoff rules as necessary. The Kubelet must preserve at least the most recent instance of an init container to serve logs and data for end users and to track failure states. The Kubelet should prefer to garbage collect completed init containers over app containers, as long as the Kubelet is able to track that initialization has been completed. In the future, container state checkpointing in the Kubelet may remove or reduce the need to preserve old init containers.
For the initial implementation, the Kubelet will use the last termination container state of the highest indexed init container to determine whether the pod has completed initialization. During a pod restart, initialization will be restarted from the beginning (all initializers will be rerun).
All APIs that access containers by name should operate on both init and app containers. Because names are unique the addition of the init container should be transparent to use cases.
A client with no knowledge of init containers should see appropriate
container status reason
and message
fields while the pod is in the
Pending
phase, and so be able to communicate that to end users.
Wait for a service to be created
pod:
spec:
initContainers:
- name: wait
image: centos:centos7
command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"]
containers:
- name: run
image: application-image
command: ["/my_application_that_depends_on_myservice"]
Register this pod with a remote server
pod:
spec:
initContainers:
- name: register
image: centos:centos7
command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"]
env:
- name: POD_NAME
valueFrom:
field: metadata.name
- name: POD_IP
valueFrom:
field: status.podIP
containers:
- name: run
image: application-image
command: ["/my_application_that_depends_on_myservice"]
Wait for an arbitrary period of time
pod:
spec:
initContainers:
- name: wait
image: centos:centos7
command: ["/bin/sh", "-c", "sleep 60"]
containers:
- name: run
image: application-image
command: ["/static_binary_without_sleep"]
Clone a git repository into a volume (can be implemented by volume containers in the future):
pod:
spec:
initContainers:
- name: download
image: image-with-git
command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"]
volumeMounts:
- mountPath: /var/lib/data
volumeName: git
containers:
- name: run
image: centos:centos7
command: ["/var/lib/data/binary"]
volumeMounts:
- mountPath: /var/lib/data
volumeName: git
volumes:
- emptyDir: {}
name: git
Execute a template transformation based on environment (can be implemented by volume containers in the future):
pod:
spec:
initContainers:
- name: copy
image: application-image
command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"]
volumeMounts:
- mountPath: /var/lib/data
volumeName: data
- name: transform
image: image-with-jinja
command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"]
volumeMounts:
- mountPath: /var/lib/data
volumeName: data
containers:
- name: run
image: application-image
command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"]
volumeMounts:
- mountPath: /var/lib/data
volumeName: data
volumes:
- emptyDir: {}
name: data
Perform a container build
pod:
spec:
initContainers:
- name: copy
image: base-image
workingDir: /home/user/source-tree
command: ["make"]
containers:
- name: commit
image: image-with-docker
command:
- /bin/sh
- -c
- docker commit $(complex_bash_to_get_container_id_of_copy) \
docker push $(commit_id) myrepo:latest
volumesMounts:
- mountPath: /var/run/docker.sock
volumeName: dockersocket
Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not be able to rely on Kubelets implementing init containers. The management of feature skew between master and Kubelet is tracked in issue #4855.