Vertical Pod Autoscaling

Warning

This is an optional beta feature and is subject to change.

This feature was introduced in Hydrolix v5.7.4.

HDX Autoscaler supports two types of scaling: horizontal scaling (adding or removing pod replicas) and vertical scaling (adjusting resource limits per pod). For information about horizontal scaling, see HDX Autoscaler with Prometheus.

Vertical Pod Autoscaling (VPA) automatically adjusts CPU, memory, and ephemeral storage resource limits for pods based on observed usage, reducing the need for manual resource tuning and helping prevent out-of-memory (OOM) events.

When VPA is enabled, HDX Autoscaler monitors actual resource consumption and dynamically adjusts pod limits within the configured ranges. This complements horizontal pod autoscaling by optimizing resource allocation per pod rather than scaling the number of pods.

When to use VPA⚓︎

Variable workloads: Workloads with changing resource patterns over time, where static limits may be too restrictive during peaks or wasteful during low-usage periods.
Preventing OOM events: Components that occasionally exceed memory limits, causing out-of-memory crashes. VPA automatically increases limits based on observed high-water marks.
New deployments: When resource requirements are unknown, VPA discovers appropriate limits through observation rather than requiring upfront capacity planning.
Reducing manual tuning: Eliminates the need for operators to continuously monitor and adjust resource allocations as workload characteristics evolve.
Cost optimization: Balances availability with efficiency by maintaining headroom above usage without excessive over-provisioning.

VPA works best for workloads where vertical scaling (more resources per pod) is more appropriate than horizontal scaling (more pods). For example, memory-intensive components like intake-head, intake-indexer, merge-peer, or reaper may benefit from VPA to handle varying data volumes without manual intervention.

Vertical and horizontal scaling comparison⚓︎

Vertical scaling (VPA) and horizontal scaling (HPA) address different aspects of resource management and work together rather than competing.

Horizontal scaling adds or removes pod replicas based on workload metrics like CPU usage or request queue depth. This distributes work across more instances and provides redundancy. Horizontal scaling responds relatively quickly (seconds to minutes) and is ideal for handling variable traffic patterns or workloads that can be distributed across multiple instances.

Vertical scaling adjusts the CPU, memory, and storage limits for individual pods based on observed resource consumption. This optimizes the resources allocated to each pod instance. Vertical scaling operates on longer timescales (hours to days) to avoid disruption from frequent pod restarts and focuses on right-sizing resources rather than distributing load.

Working together⚓︎

Both scaling approaches can be configured simultaneously. Horizontal scaling handles short-term load fluctuations by adding pods, while vertical scaling adjusts per-pod resources over time based on sustained usage patterns. The different timescales prevent conflicts: horizontal scaling reacts to immediate demand, while vertical scaling gradually optimizes resource allocation.

For example, a deployment might use horizontal scaling to add replicas during traffic spikes and vertical scaling to ensure each replica has appropriate memory limits based on observed usage patterns. The combination provides both immediate load response and long-term resource optimization.

Configuration⚓︎

VPA is configured in the hdxscalers block with resource-specific parameters. Each scaler defines the resource type to manage (CPU, memory, or storage) and the allowed range of values.

Key parameters⚓︎

Field	Description	Required
`deployment`	Kubernetes Deployment to scale	Yes
`container`	Autoscaled container name inside the Deployment	Yes
`resource`	Resource to manage (`cpu`, `memory`, `storage`)	Yes
`range`	Allowed resource limit range (for example, `500m-2000m`, `2Gi-16Gi`)	Yes
`start`	Initial request/limit if no VPA annotation exists	No

Default values⚓︎

These default values apply to VPA scaling behavior:

Setting	Default Value
Headroom	20%
Headroom threshold	50%
Horizon	7d
Cool up/down	15s
CPU step	250m
Memory step	128MiB
Storage step	1MiB

Example⚓︎

This example configures VPA for the intake-head deployment to scale memory between 256MiB and 4GiB:

Basic VPA Memory Configuration
spec:
  scale:
    intake-head:
      hdxscalers:
      - container: intake-head
        cool_down_seconds: 10
        cool_up_seconds: 10
        headroom_percent: 20
        headroom_threshold_percent: 50
        highwater_horizon_duration: 1d
        range: 256Mi-4Gi
        resource: memory
        start: 512Mi

The autoscaler monitors memory usage and adjusts the pod's memory limit within the specified range. The headroom_percent setting ensures the limit stays above observed usage to prevent OOM conditions.

Error-condition responses⚓︎

This feature was introduced in Hydrolix v5.8.

HDX Autoscaler can respond immediately to error conditions instead of waiting for normal trend-based adjustments. When an error condition is detected for a vertically autoscalable resource, HDX Autoscaler upscales immediately to prevent service disruption.

Monitored error conditions⚓︎

HDX Autoscaler watches for the following error conditions:

Memory: Out-of-memory (OOM) kill events from the Kubernetes API
CPU: Continuous CPU saturation for a configured time window
Ephemeral storage: Eviction events indicating ephemeral storage exhaustion

When any of these conditions occur, the autoscaler increases the resource limit by applying the configured headroom_percent:

Memory and storage: The new limit is the current limit plus the headroom amount. For example, with a 1000Mi limit and 20% headroom, the new limit becomes 1200Mi.
CPU saturation: The new limit is the observed high-water mark plus the headroom amount. For example, with a 1000m high-water mark and 20% headroom of the current 1500m limit, the new limit becomes 1300m (1000m + 300m).

This difference means CPU saturation scales based on actual observed usage, while memory and storage scale based on the current limit to provide immediate relief.

CPU saturation configuration⚓︎

CPU saturation monitoring requires additional configuration parameters to define when the CPU is considered saturated:

Parameter	Description	Default
`cpu_sat_min_unused_percent`	Threshold of unused CPU percentage that defines saturation. Lower values make the trigger more sensitive.	N/A
`cpu_saturation_seconds`	Duration that saturation must persist continuously before triggering an upscale.	N/A

The CPU is considered saturated when the unused CPU percentage falls below cpu_sat_min_unused_percent for the entire duration specified by cpu_saturation_seconds.

Error-condition example⚓︎

This example configures VPA with error-condition responses for the intake-head deployment, including CPU saturation monitoring:

VPA with Error-Condition Responses
spec:
  scale:
    intake-head:
      hdxscalers:
      - container: intake-head
        cool_down_seconds: 10
        cool_up_seconds: 40
        cpu_sat_min_unused_percent: 20
        cpu_saturation_seconds: 30
        headroom_percent: 20
        headroom_threshold_percent: 50
        highwater_horizon_duration: 1h
        range: 250m-3000m
        resource: cpu
        start: 750m
      - container: intake-head
        cool_down_seconds: 10
        cool_up_seconds: 40
        headroom_percent: 30
        headroom_threshold_percent: 55
        highwater_horizon_duration: 1h
        range: 512Mi-4Gi
        resource: memory
        start: 1Gi
      - container: intake-head
        headroom_percent: 30
        headroom_threshold_percent: 55
        highwater_horizon_duration: 1h
        range: 1Gi-20Gi
        resource: storage
        start: 1Gi

In this configuration:

CPU scaling: Triggers immediate upscale when CPU has less than 20% unused capacity for 30 consecutive seconds
Memory scaling: Triggers immediate upscale on OOM events
Storage scaling: Triggers immediate upscale on ephemeral storage eviction events

Upscaling process⚓︎

When HDX Autoscaler detects an error condition and increases resource limits, Kubernetes restarts the affected pods to apply the new limits. This restart is brief but results in momentary service interruption for that specific pod. If multiple replicas exist, the service remains available through other pods during the restart.

Rate limiting⚓︎

The cool_up_seconds parameter prevents rapid successive scaling. After an error-condition triggers an upscale, the autoscaler waits for the configured cool-up period before allowing another upscale, even if additional error conditions occur. This prevents resource limits from increasing too rapidly.

Maximum range behavior⚓︎

When a pod reaches the maximum value in the configured range (for example, 4Gi in a 512Mi-4Gi range), error-condition responses can no longer increase the limit. The autoscaler logs these events but can't scale further. Monitor autoscaler logs to detect when pods consistently reach their maximum limits, which may indicate the need to increase the range or investigate the underlying cause.

Scale-down behavior⚓︎

Error-condition responses only trigger immediate upscaling. Scale-down continues to use the normal trend-based approach controlled by cool_down_seconds and highwater_horizon_duration. The autoscaler evaluates resource usage over time and gradually reduces limits when sustained usage patterns show excess headroom, preventing unnecessary scaling churn.

Storage monitoring reliability⚓︎

Storage error-condition monitoring depends on how the cluster reports ephemeral storage eviction events. The feature requires specific Kubernetes event messages that may not be consistently available across all cluster configurations and metric exporters.

Test storage monitoring in a non-production environment before relying on it. Monitor HDX Autoscaler logs after configuring storage VPA to verify scaling events occur as expected. Memory and CPU error-condition monitoring are more reliable and recommended for production workloads. Consider storage monitoring as a supplementary protection rather than the primary safeguard against disk exhaustion.

Verify error-condition responses⚓︎

When error-condition responses trigger, HDX Autoscaler logs the scaling decision. Monitor these logs to verify the feature is working correctly.

OOM response example:

HDX Autoscaler Log: OOM Response
1 2	`[VPA/routing/intake-head] scale_up_instant_oom: old=1152Mi new=1536Mi events=1 [VPA/routing/intake-head] patch routing/intake-head:memory -> 1536Mi (1610612736)`

The autoscaler detected an OOM event and immediately increased the memory limit from 1152Mi to 1536Mi.

CPU saturation response example:

HDX Autoscaler Log: CPU Saturation Response
1 2	`[VPA/routing/intake-head] scale_up_instant_cpu_saturation: old=1250m new=1500m hwm=1017m [VPA/routing/intake-head] patch routing/intake-head:cpu -> 1500m (1500)`

The autoscaler detected CPU saturation and increased the CPU limit from 1250m to 1500m. The hwm=1017m indicates the high-water mark of actual CPU usage.

Troubleshoot vertical autoscaling⚓︎

No scaling on OOM: Verify the container name in the VPA configuration matches the container experiencing OOM events
CPU saturation not triggering: Check that cpu_sat_min_unused_percent and cpu_saturation_seconds are configured appropriately for the workload
Secondary containers: Error-condition responses only apply to containers explicitly configured in the VPA settings. Other containers in the same pod restart without scaling changes