Kubernetes Stateful Deployments
A solution for stateful, yet interchangeable, workloads.

The Challenge
Let us consider the situation were we need to run a stateful Kubernetes workloads on a Cluster; the standard solution is to run it as a StatefulSet.
StatefulSet is the workload API object used to manage stateful applications.
Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods.
Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.
If you want to use storage volumes to provide persistence for your workload, you can use a StatefulSet as part of the solution. Although individual Pods in a StatefulSet are susceptible to failure, the persistent Pod identifiers make it easier to match existing volumes to the new Pods that replace any that have failed.
It is the StatefulSet’s spec.volumeClaimTemplates that creates a PersistentVolumeClaim, and thus PersistentVolume, for each Pod.
StatefulSet have a particular scaling behavior; essentially they scale up and down one Pod at a time:
- For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
- When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
- Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
- Before a Pod is terminated, all of its successors must be completely shutdown.
Let us now consider the situation where the Pods in our stateful Kubernetes workload are interchangeable and want the workload to scale more like a Deployment, i.e., multiple Pods can be created and terminated simultaneously. One specific example of this scenario is a workload that uses a disk cache.
One possible alternative to using StatefulSets for this particular scenario is to use a Deployment where Pods use the Node’s local ephemeral storage.
Nodes have local ephemeral storage, backed by locally-attached writeable devices or, sometimes, by RAM. “Ephemeral” means that there is no long-term guarantee about durability.
Pods use ephemeral local storage for scratch space, caching, and for logs. The kubelet can provide scratch space to Pods using local ephemeral storage to mount emptyDir volumes into containers.
— Managing Resources for Containers
To ensure the Pod has sufficient ephemeral-storage to operate, we can use resource requests and limits.
You can use ephemeral-storage for managing local ephemeral storage. Each Container of a Pod can specify one or more of the following:
- spec.containers[].resources.limits.ephemeral-storage
- spec.containers[].resources.requests.ephemeral-storage
— Managing Resources for Containers
The challenge with this solution, however, is that we run into bin packing problems, e.g., we end up having Nodes with sufficient ephemeral-storage but too little CPU (or visa-versa).
Another very specific challenge with this solution is if one wants to take advantage of Kubernetes Volume Snapshots.
In Kubernetes, a VolumeSnapshot represents a snapshot of a volume on a storage system.
If only the Kubernetes Deployment spec supported something like the StatefulSet’s spec.volumeClaimTemplates; it doesn’t.
A Solution
With the solution in place (explained below), we can use a Deployment such as the following for our workload.
Things to observe:
- We introduce two Pod annotations; one specifies a volume name and the other its size
- The solution creates a PersistentVolumeClaim unique to the Deployment’s Pod and sized to the volume size annotation. It is also labeled with key managed and value true
- It then updates the Pod’s volume named with the volume name annotation to reference the PersistentVolumeClaim; here replacing the emptyDir with the PersistentVolumeClaim
- It also updates the Pod’s labels with key volume-claim/name; the value being the PersistentVolumeClaim’s name
In this solution, we create a mutating admission webhook to perform these operations.
Admission webhooks are HTTP callbacks that receive admission requests and do something with them. You can define two types of admission webhooks, validating admission webhook and mutating admission webhook. Mutating admission webhooks are invoked first, and can modify objects sent to the API server to enforce custom defaults. After all object modifications are complete, and after the incoming object is validated by the API server, validating admission webhooks are invoked and can reject requests to enforce custom policies.
The sample code for this mutating admission webhook builds off of the article Kubernetes Dynamic Admission Control by Example; the logic is in the file app/app.js.
One bit of extra complexity, however, is that this mutating admission webhook has side effects, i.e., creating the PersistentVolumeClaim.
Webhooks typically operate only on the content of the AdmissionReview sent to them. Some webhooks, however, make out-of-band changes as part of processing admission requests.
Webhooks that make out-of-band changes (“side effects”) must also have a reconciliation mechanism (like a controller) that periodically determines the actual state of the world, and adjusts the out-of-band data modified by the admission webhook to reflect reality. This is because a call to an admission webhook does not guarantee the admitted object will be persisted as is, or at all. Later webhooks can modify the content of the object, a conflict could be encountered while writing to storage, or the server could power off before persisting the object.
As indicated, we also create a controller that performs two operations:
- If the mutating admission webhook creates a PersistentVolumeClaim but the Pod is not persisted, the controller deletes the PersistentVolumeClaim
- Watches for Pod deletions and deletes the associated PersistentVolumeClaims created by the mutating admission webhook
These two operations can be accomplished with a CronJob that deletes PersistentVolumeClaims created by the mutating admission webhook without an associated Pod; left it to the reader to implement this.