Table of Contents
Currently most cascading deletion logic is implemented at client-side. For example, when deleting a replica set, kubectl uses a reaper to delete the created pods and then delete the replica set. We plan to move the cascading deletion to the server to simplify the client-side logic. In this proposal, we present the garbage collector which implements cascading deletion for all API resources in a generic way; we also present the finalizer framework, particularly the "orphan" finalizer, to enable flexible alternation between cascading deletion and orphaning.
Goals of the design include:
Non-goals include:
type ObjectMeta struct {
...
OwnerReferences []OwnerReference
}
ObjectMeta.OwnerReferences:
List of objects depended by this object. If all objects in the list have been deleted, this object will be garbage collected. For example, a replica set R
created by a deployment D
should have an entry in ObjectMeta.OwnerReferences pointing to D
, set by the deployment controller when R
is created. This field can be updated by any client that has the privilege to both update and delete the object. For safety reasons, we can add validation rules to restrict what resources could be set as owners. For example, Events will likely be banned from being owners.
type OwnerReference struct {
// Version of the referent.
APIVersion string
// Kind of the referent.
Kind string
// Name of the referent.
Name string
// UID of the referent.
UID types.UID
}
OwnerReference struct: OwnerReference contains enough information to let you identify an owning object. Please refer to the inline comments for the meaning of each field. Currently, an owning object must be in the same namespace as the dependent object, so there is no namespace field.
The Garbage Collector is responsible to delete an object if none of the owners listed in the object's OwnerReferences exist. The Garbage Collector consists of a scanner, a garbage processor, and a propagator.
Scanner:
Garbage Processor:
Propagator:
Users may want to delete an owning object (e.g., a replicaset) while orphaning the dependent object (e.g., pods), that is, leaving the dependent objects untouched. We support such use cases by introducing the "orphan" finalizer. Finalizer is a generic API that has uses other than supporting orphaning, so we first describe the generic finalizer framework, then describe the specific design of the "orphan" finalizer.
type ObjectMeta struct {
…
Finalizers []string
}
ObjectMeta.Finalizers: List of finalizers that need to run before deleting the object. This list must be empty before the object is deleted from the registry. Each string in the list is an identifier for the responsible component that will remove the entry from the list. If the deletionTimestamp of the object is non-nil, entries in this list can only be removed. For safety reasons, updating finalizers requires special privileges. To enforce the admission rules, we will expose finalizers as a subresource and disallow directly changing finalizers when updating the main resource.
ObjectMeta.Finalizers
of the object being deleted is non-empty, then updates the DeletionTimestamp, but does not delete the object.ObjectMeta.Finalizers
is empty and the options.GracePeriod is zero, then deletes the object. If the options.GracePeriod is non-zero, then just updates the DeletionTimestamp.type DeleteOptions struct {
…
OrphanDependents bool
}
DeleteOptions.OrphanDependents: allows a user to express whether the dependent objects should be orphaned. It defaults to true, because controllers before release 1.2 expect dependent objects to be orphaned.
Adding a fourth component to the Garbage Collector, the"orphan" finalizer:
OwnerReferences
of its dependents.
ObjectMeta.Finalizers
of the object.Controllers are responsible for adopting orphaned dependent resources. To do so, controllers
There is a potential race between the "orphan" finalizer removing an owner reference and the controllers adding it back during adoption. Imagining this case: a user deletes an owning object and intends to orphan the dependent objects, so the GC removes the owner from the dependent object's OwnerReferences list, but the controller of the owner resource hasn't observed the deletion yet, so it adopts the dependent again and adds the reference back, resulting in the mistaken deletion of the dependent object. This race can be avoided by implementing Status.ObservedGeneration in all resources. Before updating the dependent Object's OwnerReferences, the "orphan" finalizer checks Status.ObservedGeneration of the owning object to ensure its controller has already observed the deletion.
For the master, after upgrading to a version that supports cascading deletion, the OwnerReferences of existing objects remain empty, so the controllers will regard them as orphaned and start the adoption procedures. After the adoptions are done, server-side cascading will be effective for these existing objects.
For nodes, cascading deletion does not affect them.
For kubectl, we will keep the kubectl’s cascading deletion logic for one more release.
This section presents an example of all components working together to enforce the cascading deletion or orphaning.
D1
.D1
in the DAG.D1
. It creates the replicaset R1
, whose OwnerReferences field contains a reference to D1
, and has the "orphan" finalizer in its ObjectMeta.Finalizers map.R1
. It creates an entry of R1
in the DAG, with D1
as its owner.R1
and creates Pods P1
~Pn
, all with R1
in their OwnerReferences.P1
~Pn
. It creates entries for them in the DAG, with R1
as their owner.In case the user wants to cascadingly delete D1
's descendants, then
D1
, with DeleteOptions.OrphanDependents=false
. API server checks if D1
has "orphan" finalizer in its Finalizers map, if so, it updates D1
to remove the "orphan" finalizer. Then API server deletes D1
.D1
has an empty Finalizers map.D1
. It deletes D1
from the DAG. It adds its dependent object, replicaset R1
, to the dirty queue.R1
from the dirty queue. It finds R1
has an owner reference pointing to D1
, and D1
no longer exists, so it requests API server to delete R1
, with DeleteOptions.OrphanDependents=false
. (The Garbage Processor should always set this field to false.)R1
to remove the "orphan" finalizer if it's in the R1
's Finalizers map. Then the API server deletes R1
, as R1
has an empty Finalizers map.R1
. It deletes R1
from the DAG. It adds its dependent objects, Pods P1
~Pn
, to the Dirty Queue.Px
(1 <= x <= n) from the Dirty Queue. It finds that Px
have an owner reference pointing to D1
, and D1
no longer exists, so it requests API server to delete Px
, with DeleteOptions.OrphanDependents=false
.In case the user wants to orphan D1
's descendants, then
D1
, with DeleteOptions.OrphanDependents=true
.D1
, with DeletionTimestamp=now and DeletionGracePeriodSeconds=0, increments the Generation by 1, and add the "orphan" finalizer to ObjectMeta.Finalizers if it's not present yet. The API server does not delete D1
, because its Finalizers map is not empty.D1
's ObservedGeneration. The deployment controller won't create more replicasets on D1
's behalf.R1
to remove D1
from its OwnerReferences. At last, it updates D1
, removing itself from D1
's Finalizers map.D1
, because i) DeletionTimestamp is non-nil, ii) the DeletionGracePeriodSeconds is zero, and iii) the last finalizer is removed from the Finalizers map, API server deletes D1
.D1
. It deletes D1
from the DAG. It adds its dependent, replicaset R1
, to the Dirty Queue.R1
from the Dirty Queue and skips it, because its OwnerReferences is empty.The presented design will respect the setting in the deletion request of last owner.
Propagating grace period in a cascading deletion is a non-goal of this proposal. Nevertheless, the presented design can be extended to support it. A tentative solution is letting the garbage collector to propagate the grace period when deleting dependent object. To persist the grace period set by the user, the owning object should not be deleted from the registry until all its dependent objects are in the graceful deletion state. This could be ensured by introducing another finalizer, tentatively named as the "populating graceful deletion" finalizer. Upon receiving the graceful deletion request, the API server adds this finalizer to the finalizers list of the owning object. Later the GC will remove it when all dependents are in the graceful deletion state.
#25055 tracks this problem.
A tentative solution is introducing a "completing cascading deletion" finalizer, which will be added to the finalizers list of the owning object, and removed by the GC when all dependents are deleted. The user can watch for the deletion event of the owning object to ensure the cascading deletion process has completed.
type DeleteOptions struct {
…
OrphanChildren bool
}
DeleteOptions.OrphanChildren: allows a user to express whether the child objects should be orphaned.
type ObjectMeta struct {
...
ParentReferences []ObjectReference
}
ObjectMeta.ParentReferences: links the resource to the parent resources. For example, a replica set R
created by a deployment D
should have an entry in ObjectMeta.ParentReferences pointing to D
. The link should be set when the child object is created. It can be updated after the creation.
type Tombstone struct {
unversioned.TypeMeta
ObjectMeta
UID types.UID
}
Tombstone: a tombstone is created when an object is deleted and the user requires the children to be orphaned. Tombstone.UID: the UID of the original object.
The only new component is the Garbage Collector, which consists of a scanner, a garbage processor, and a propagator.
Scanner:
Garbage Processor:
Propagator:
Storage: we should add a REST storage for Tombstones. The index should be UID rather than namespace/name.
API Server: when handling a deletion request, if DeleteOptions.OrphanChildren is true, then the API Server either creates a tombstone with TTL if the tombstone doesn't exist yet, or updates the TTL of the existing tombstone. The API Server deletes the object after the tombstone is created.
Controllers: when creating child objects, controllers need to fill up their ObjectMeta.ParentReferences field. Objects that don’t have a parent should have the namespace object as the parent.
The main difference between the two designs is when to update the ParentReferences. In design #1, because a tombstone is created to indicate "orphaning" is desired, the updates to ParentReferences can be deferred until the deletion of the tombstone. In design #2, the updates need to be done before the parent object is deleted from the registry.
In case the garbage collector is mistakenly deleting objects, we should provide mechanism to stop the garbage collector and restore the objects.
We will add a "--enable-garbage-collector" flag to the controller manager binary to indicate if the garbage collector should be enabled. Admin can stop the garbage collector in a running cluster by restarting the kube-controller-manager with --enable-garbage-collector=false.
Restoring mistakenly deleted objects
States should be stored in etcd. All components should remain stateless.
A preliminary design
This is a generic design for “undoing a deletion”, not specific to undoing cascading deletion.
/archive
sub-resource to every resource, it's used to store the spec of the deleted objects.kubectl restore
command, which takes a resource/name pair as input, creates the object with the spec stored in the /archive, and deletes the archived object.