Descheduler
This feature was introduced in Hydrolix version 5.11.
The Hydrolix descheduler periodically evaluates already-running pods and evicts those on underutilized nodes so the scheduler can repack them. For an overview of how the scheduler and descheduler work together as a cost-optimization loop, see Scheduler and Descheduler.
Enable the descheduler⚓︎
Add the descheduler section to your HydrolixCluster spec:
Descheduler defaults
When enabled with no other configuration, the descheduler uses the following defaults:
- Strategy:
HighNodeUtilization - Thresholds: 40% CPU, 40% Memory
- Interval: every 5 minutes
The descheduler computes node utilization from the sum of CPU and memory requests declared by pods scheduled to each node, not from real-time consumption. A threshold of 40 means 40% of the node's allocatable CPU or memory is reserved by pod requests, regardless of what those pods are actually consuming. Tune thresholds against the Allocated resources section of kubectl describe node, not kubectl top or Prometheus consumption metrics. For more detail, see the upstream descheduler documentation.
The descheduler only evaluates and evicts pods in the same Kubernetes namespace as your Hydrolix deployment. Pods in other namespaces are never touched.
Configuration reference⚓︎
| Field | Type | Default | Description |
|---|---|---|---|
descheduler.enabled |
bool | false |
Enable the descheduler |
descheduler.strategy |
string | HighNodeUtilization |
HighNodeUtilization (consolidate) or LowNodeUtilization (spread) |
descheduler.thresholds |
dict | {cpu: 40, memory: 40} |
Nodes below all thresholds are candidates for eviction |
descheduler.target_thresholds |
dict | none | Upper bound for LowNodeUtilization only. Ignored for HighNodeUtilization |
descheduler.max_unavailable |
int or string | 1 |
Default maxUnavailable value applied to every PodDisruptionBudget the operator creates automatically |
scale.<service>.pdb_max_unavailable |
int or string | unset | Override descheduler.max_unavailable for a specific service |
descheduler.protected_services |
list | [] |
Additional services (by app label) to shield from eviction. Each listed service receives a PDB with maxUnavailable: 0, which blocks all voluntary disruptions (including node drains and upgrades), not only descheduler evictions. The operator and descheduler are shielded this way automatically |
descheduler.descheduling_interval |
string | 5m |
How often the descheduler evaluates node utilization |
descheduler.image_version |
string | v0.35.0 |
Descheduler image tag |
descheduler.evictor |
dict | See the following example | Passthrough arguments for the DefaultEvictor plugin |
The default evictor configuration disables the PodsWithLocalStorage protection and enables the PodsWithPVC protection:
| Default Evictor Configuration | |
|---|---|
To change which pods the descheduler can evict, replace the descheduler.evictor block in your HydrolixCluster spec. The operator passes your block straight through to the descheduler's DefaultEvictor plugin, so any option the plugin accepts is valid. Common reasons to override:
- Allow eviction of pods with PVC-backed storage by removing
PodsWithPVCfromextraEnabled. - Protect pods that use local storage on the node by removing
PodsWithLocalStoragefromdefaultDisabled. - Gate eviction by pod priority class by adding a
priorityThreshold.
Automatic PodDisruptionBudget creation⚓︎
When you enable the descheduler, the operator automatically creates a PodDisruptionBudget (PDB) for every service with more than one replica. Each PDB uses:
maxUnavailable: 1matchLabels: {app: <service-name>}
To change the maxUnavailable value, set descheduler.max_unavailable for a global default, or scale.<service>.pdb_max_unavailable for a specific service.
Services with a single replica don't receive a PDB. If the same service appears under multiple scale entries, only one PDB is created for it.
Autoscaled services and PDBs
A service whose replicas field is a range (for example, 1-5) is evaluated by its lower bound. If the lower bound is 1, no PDB is created, even if the service typically runs several replicas. Set the lower bound to 2 or above to receive PDB protection during eviction.
The PDBs the operator creates protect your workloads during eviction: the descheduler never evicts a pod when doing so would violate an active PDB. This guarantees that no more than the configured maxUnavailable pods of a given service are disrupted at the same time.
To express max_unavailable as a percentage rather than an integer, use a quoted string:
Rounding behavior
Kubernetes rounds up percentage-based maxUnavailable values when applying them to a pod count. For example, "10%" of a service with 5 replicas resolves to 1 pod, not 0, so the descheduler can disrupt 20% of the service rather than the literal 10%. At small replica counts, prefer integer values to control eviction precisely.
Eviction strategies⚓︎
The descheduler offers two eviction strategies. HighNodeUtilization (the default) evicts pods from underutilized nodes to consolidate workloads onto fewer nodes. Choose this strategy for cost-driven autoscaling clusters. LowNodeUtilization evicts pods from overutilized nodes to balance workload across the cluster. Choose this strategy when even resource distribution matters more than minimizing node count.
HighNodeUtilization⚓︎
Evicts pods from underutilized nodes to consolidate workloads onto fewer nodes. This is the recommended strategy for cost savings.
| HighNodeUtilization Strategy | |
|---|---|
A node is underutilized when all resource usages are below their respective threshold. In this example, a node qualifies when both its CPU utilization and memory utilization are below 40%. Raise the thresholds to be more aggressive with consolidation.
Choosing thresholds
Set thresholds high enough that multiple nodes become candidates for draining, but not so high that you evict from nodes whose requested allocation is already a meaningful fraction of capacity. Inspect the Allocated resources section of kubectl describe node to see what percentage of each node's CPU and memory is reserved by scheduled pods, and tune thresholds against that distribution.
LowNodeUtilization⚓︎
Evicts pods from overutilized nodes to spread workloads across the cluster.
| LowNodeUtilization Strategy | |
|---|---|
A node is overutilized when any resource usage exceeds its target_thresholds value, and underutilized when all resource usages are below thresholds. In this example, the descheduler considers nodes for eviction when their CPU or memory usage exceeds 80%, and only evicts if at least one other node sits below 20% on both resources to receive the displaced pods. Nodes between 20% and 80% on both resources are considered well-utilized and aren't disturbed.
Pods protected from eviction⚓︎
Regardless of configuration, the descheduler never evicts:
- DaemonSet pods
- Mirror pods
- Pods in namespaces outside your deployment
- Pods protected by a PodDisruptionBudget that would be violated
- Pods with PVC-backed storage, because Hydrolix enables
PodsWithPVCin the defaultdescheduler.evictorconfig (override this by supplying your ownevictorblock) - The
operatoranddeschedulerpods, which always receive a PDB withmaxUnavailable: 0 - Any services listed in
descheduler.protected_services, which also receive a PDB withmaxUnavailable: 0
Verify the descheduler⚓︎
Watch descheduler evictions⚓︎
| Watch Descheduler Logs | |
|---|---|
The logs show which pods are evicted, which nodes they're evicted from, and how many evictions occur per cycle. The descheduler evaluates the cluster at the interval defined by descheduler.descheduling_interval (default 5m).
Descheduler Prometheus metrics⚓︎
The descheduler emits Prometheus metrics. Useful signals include counts of pods evicted, eviction errors, and strategy run duration. For the complete list, see the Kubernetes descheduler v0.35.0 metrics documentation.
Confirm cost savings at the node level⚓︎
If the Kubernetes Cluster Autoscaler is configured, watch node count shrink over multiple descheduler cycles as pods consolidate and the autoscaler removes the resulting empty nodes:
| Watch Node Count | |
|---|---|
Disable the descheduler⚓︎
Setting descheduler.enabled: false stops further evictions and removes the PodDisruptionBudgets the operator created automatically. Pods that were already evicted remain evicted; pods currently running continue to run.
Troubleshooting⚓︎
The descheduler runs but never evicts anything⚓︎
Check, in order:
- Utilizations lie within the thresholds. If every node sits above the
HighNodeUtilizationthresholds (default 40% CPU and 40% memory) or below theLowNodeUtilizationtarget thresholds during evaluation, no nodes qualify for eviction. Inspect theAllocated resourcessection ofkubectl describe nodeto see how each node's requested CPU and memory compare against the configured thresholds. - Too many services are fully protected. Services in
descheduler.protected_services, plusoperatoranddescheduler, receive a PDB withmaxUnavailable: 0, which blocks all eviction. - Services with a
replicasrange starting at 1 receive no PDB. See Automatic PodDisruptionBudget creation for the lower-bound behavior.
Evictions are too aggressive⚓︎
Raise descheduler.thresholds closer to your cluster's typical request floor so fewer nodes qualify as underutilized. Sum the Requests values in the Allocated resources section of kubectl describe node output across your nodes to establish that baseline.