v6.0.12

Hardened the descheduler eviction policy, added retries for storage and certificate failures, and improved intake handling of token_list: null in stream settings.

This release contains bug fixes to Hydrolix v6.0. Refer to the release notes to see other notable feature announcements and information for this version.

Upgrade⚓︎

Don't skip minor versions when upgrading or downgrading

Skipping versions when upgrading or downgrading Hydrolix can result in database schema inconsistencies and cluster instability. Always upgrade or downgrade sequentially through each minor version.

Example:
Upgrade from 5.10.10 → 5.11.9 → 6.0.8, not 5.10.10 → 6.0.8.

Upgrade on GKE⚓︎

Upgrade on GKE
kubectl apply -f "https://www.hydrolix.io/operator/v6.0.12/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&gcp-storage-sa=${GCP_STORAGE_SA}"

Upgrade on EKS⚓︎

Upgrade on EKS
kubectl apply -f "https://www.hydrolix.io/operator/v6.0.12/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&aws-storage-role=${AWS_STORAGE_ROLE}"

Upgrade on LKE⚓︎

Upgrade on LKE
kubectl apply -f "https://www.hydrolix.io/operator/v6.0.12/operator-resources?namespace=$HDX_KUBERNETES_NAMESPACE"

Changelog⚓︎

Improvements⚓︎

Cluster operations improvements⚓︎

Updates to the descheduler eviction policy so it never evicts the descheduler or operator pods, and never causes involuntary downtime for single-replica workloads. The descheduler and operator pods are now protected from self-eviction through the DefaultEvictor label selector instead of a maxUnavailable: 0 Pod Disruption Budget (PDB). Nodes running descheduler or operator pods no longer block kubectl drain or cluster autoscaler scale-down. Single-replica workloads are never evicted. A new descheduler.min_replicas tunable (default 2) ensures pods whose owning workload has fewer than two replicas are skipped by descheduler runs. Set it to 0 to consider all pods regardless of replica count.

Bug fixes⚓︎

Cluster operations bug fixes⚓︎

Fixed an issue where a transient failure during initial certificate issuance on a freshly provisioned cluster could leave Traefik serving its self-signed default certificate for up to 24 hours. After a failed certificate check, the ACME client now retries after 15 minutes instead of waiting for the next daily check. Once a check succeeds, it returns to the configured interval.

Intake bug fixes⚓︎

Fixed a bug where a token_list: null value in stream settings failed to deserialize and aborted the entire configuration parse. token_list: null now matches the behavior of an empty list value.
Improved handling of Azure DNS responses for a removed storage error. It's now treated as a transient error and retried. Corrected error reporting for Azure DNS resolution failures and Google Cloud Storage no-such-bucket responses.