Skip to content

Descheduler

This feature was introduced in Hydrolix version 5.11.

The Hydrolix descheduler periodically evaluates already-running pods and evicts those on underutilized nodes so the scheduler can repack them. For an overview of how the scheduler and descheduler work together as a cost-optimization loop, see Scheduler and Descheduler.

Enable the descheduler⚓︎

Add the descheduler section to your HydrolixCluster spec:

Enable the Descheduler
1
2
3
spec:
  descheduler:
    enabled: true

Descheduler defaults

When enabled with no other configuration, the descheduler uses the following defaults:

  • Strategy: HighNodeUtilization
  • Thresholds: 40% CPU, 40% Memory
  • Interval: every 5 minutes

The descheduler computes node utilization from the sum of CPU and memory requests declared by pods scheduled to each node, not from real-time consumption. A threshold of 40 means 40% of the node's allocatable CPU or memory is reserved by pod requests, regardless of what those pods are actually consuming. Tune thresholds against the Allocated resources section of kubectl describe node, not kubectl top or Prometheus consumption metrics. For more detail, see the upstream descheduler documentation.

The descheduler only evaluates and evicts pods in the same Kubernetes namespace as your Hydrolix deployment. Pods in other namespaces are never touched.

Configuration reference⚓︎

Field Type Default Description
descheduler.enabled bool false Enable the descheduler
descheduler.strategy string HighNodeUtilization HighNodeUtilization (consolidate) or LowNodeUtilization (spread)
descheduler.thresholds dict {cpu: 40, memory: 40} Nodes below all thresholds are candidates for eviction
descheduler.target_thresholds dict none Upper bound for LowNodeUtilization only. Ignored for HighNodeUtilization
descheduler.max_unavailable int or string 1 Default maxUnavailable value applied to every PodDisruptionBudget the operator creates automatically
scale.<service>.pdb_max_unavailable int or string unset Override descheduler.max_unavailable for a specific service
descheduler.protected_services list [] Additional services (by app label) to shield from eviction. Each listed service receives a PDB with maxUnavailable: 0, which blocks all voluntary disruptions (including node drains and upgrades), not only descheduler evictions. The operator and descheduler are shielded this way automatically
descheduler.descheduling_interval string 5m How often the descheduler evaluates node utilization
descheduler.image_version string v0.35.0 Descheduler image tag
descheduler.evictor dict See the following example Passthrough arguments for the DefaultEvictor plugin

The default evictor configuration disables the PodsWithLocalStorage protection and enables the PodsWithPVC protection:

Default Evictor Configuration
1
2
3
4
5
6
evictor:
  podProtections:
    defaultDisabled:
      - PodsWithLocalStorage
    extraEnabled:
      - PodsWithPVC

To change which pods the descheduler can evict, replace the descheduler.evictor block in your HydrolixCluster spec. The operator passes your block straight through to the descheduler's DefaultEvictor plugin, so any option the plugin accepts is valid. Common reasons to override:

  • Allow eviction of pods with PVC-backed storage by removing PodsWithPVC from extraEnabled.
  • Protect pods that use local storage on the node by removing PodsWithLocalStorage from defaultDisabled.
  • Gate eviction by pod priority class by adding a priorityThreshold.

Automatic PodDisruptionBudget creation⚓︎

When you enable the descheduler, the operator automatically creates a PodDisruptionBudget (PDB) for every service with more than one replica. Each PDB uses:

  • maxUnavailable: 1
  • matchLabels: {app: <service-name>}

To change the maxUnavailable value, set descheduler.max_unavailable for a global default, or scale.<service>.pdb_max_unavailable for a specific service.

Services with a single replica don't receive a PDB. If the same service appears under multiple scale entries, only one PDB is created for it.

Autoscaled services and PDBs

A service whose replicas field is a range (for example, 1-5) is evaluated by its lower bound. If the lower bound is 1, no PDB is created, even if the service typically runs several replicas. Set the lower bound to 2 or above to receive PDB protection during eviction.

The PDBs the operator creates protect your workloads during eviction: the descheduler never evicts a pod when doing so would violate an active PDB. This guarantees that no more than the configured maxUnavailable pods of a given service are disrupted at the same time.

To express max_unavailable as a percentage rather than an integer, use a quoted string:

Percentage-Based Max Unavailable
1
2
3
4
spec:
  descheduler:
    enabled: true
    max_unavailable: "10%"

Rounding behavior

Kubernetes rounds up percentage-based maxUnavailable values when applying them to a pod count. For example, "10%" of a service with 5 replicas resolves to 1 pod, not 0, so the descheduler can disrupt 20% of the service rather than the literal 10%. At small replica counts, prefer integer values to control eviction precisely.

Eviction strategies⚓︎

The descheduler offers two eviction strategies. HighNodeUtilization (the default) evicts pods from underutilized nodes to consolidate workloads onto fewer nodes. Choose this strategy for cost-driven autoscaling clusters. LowNodeUtilization evicts pods from overutilized nodes to balance workload across the cluster. Choose this strategy when even resource distribution matters more than minimizing node count.

HighNodeUtilization⚓︎

Evicts pods from underutilized nodes to consolidate workloads onto fewer nodes. This is the recommended strategy for cost savings.

HighNodeUtilization Strategy
1
2
3
4
5
6
7
spec:
  descheduler:
    enabled: true
    strategy: HighNodeUtilization
    thresholds:
      cpu: 40
      memory: 40

A node is underutilized when all resource usages are below their respective threshold. In this example, a node qualifies when both its CPU utilization and memory utilization are below 40%. Raise the thresholds to be more aggressive with consolidation.

Choosing thresholds

Set thresholds high enough that multiple nodes become candidates for draining, but not so high that you evict from nodes whose requested allocation is already a meaningful fraction of capacity. Inspect the Allocated resources section of kubectl describe node to see what percentage of each node's CPU and memory is reserved by scheduled pods, and tune thresholds against that distribution.

LowNodeUtilization⚓︎

Evicts pods from overutilized nodes to spread workloads across the cluster.

LowNodeUtilization Strategy
spec:
  descheduler:
    enabled: true
    strategy: LowNodeUtilization
    thresholds:
      cpu: 20
      memory: 20
    target_thresholds:
      cpu: 80
      memory: 80

A node is overutilized when any resource usage exceeds its target_thresholds value, and underutilized when all resource usages are below thresholds. In this example, the descheduler considers nodes for eviction when their CPU or memory usage exceeds 80%, and only evicts if at least one other node sits below 20% on both resources to receive the displaced pods. Nodes between 20% and 80% on both resources are considered well-utilized and aren't disturbed.

Pods protected from eviction⚓︎

Regardless of configuration, the descheduler never evicts:

  • DaemonSet pods
  • Mirror pods
  • Pods in namespaces outside your deployment
  • Pods protected by a PodDisruptionBudget that would be violated
  • Pods with PVC-backed storage, because Hydrolix enables PodsWithPVC in the default descheduler.evictor config (override this by supplying your own evictor block)
  • The operator and descheduler pods, which always receive a PDB with maxUnavailable: 0
  • Any services listed in descheduler.protected_services, which also receive a PDB with maxUnavailable: 0

Verify the descheduler⚓︎

Watch descheduler evictions⚓︎

Watch Descheduler Logs
kubectl logs -n <hydrolix-namespace> deployment/descheduler

The logs show which pods are evicted, which nodes they're evicted from, and how many evictions occur per cycle. The descheduler evaluates the cluster at the interval defined by descheduler.descheduling_interval (default 5m).

Descheduler Prometheus metrics⚓︎

The descheduler emits Prometheus metrics. Useful signals include counts of pods evicted, eviction errors, and strategy run duration. For the complete list, see the Kubernetes descheduler v0.35.0 metrics documentation.

Confirm cost savings at the node level⚓︎

If the Kubernetes Cluster Autoscaler is configured, watch node count shrink over multiple descheduler cycles as pods consolidate and the autoscaler removes the resulting empty nodes:

Watch Node Count
kubectl get nodes

Disable the descheduler⚓︎

Setting descheduler.enabled: false stops further evictions and removes the PodDisruptionBudgets the operator created automatically. Pods that were already evicted remain evicted; pods currently running continue to run.

Troubleshooting⚓︎

The descheduler runs but never evicts anything⚓︎

Check, in order:

  1. Utilizations lie within the thresholds. If every node sits above the HighNodeUtilization thresholds (default 40% CPU and 40% memory) or below the LowNodeUtilization target thresholds during evaluation, no nodes qualify for eviction. Inspect the Allocated resources section of kubectl describe node to see how each node's requested CPU and memory compare against the configured thresholds.
  2. Too many services are fully protected. Services in descheduler.protected_services, plus operator and descheduler, receive a PDB with maxUnavailable: 0, which blocks all eviction.
  3. Services with a replicas range starting at 1 receive no PDB. See Automatic PodDisruptionBudget creation for the lower-bound behavior.

Evictions are too aggressive⚓︎

Raise descheduler.thresholds closer to your cluster's typical request floor so fewer nodes qualify as underutilized. Sum the Requests values in the Allocated resources section of kubectl describe node output across your nodes to establish that baseline.