09 Jun 2025 - v5.3.0

Auto-scaler scale pods to minimal, independent Prometheus operator

New features in 5.3.0

Scale pods to minimal with autoscaler

Services can now dynamically shrink to zero pods, cutting idle costs.

  • Use precision to choose how many decimals to keep when rounding the average + target ratio. Smaller numbers round to zero sooner, making scale-to-zero trigger more often.
  • Replica counts now use the deployment name/alias, not the app label, to fix cross-service scaling.
  • The cool-down period is respected after configuration changes or scaler restarts, preventing sudden swings.
  • Logic to grow back from zero means a service can rise above zero pods when the load returns.
  • See Scale Your Cluster for more details.

Enable an independent Prometheus operator in Hydrolix

  • Added support for ServiceMonitor to enable an independent Prometheus operator. The Hydrolix Prometheus integration can also be disabled as needed.

GKE

kubectl apply -f "https://www.hydrolix.io/operator/v5.3.0/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&gcp-storage-sa=${GCP_STORAGE_SA}"

EKS

kubectl apply -f "https://www.hydrolix.io/operator/v5.3.0/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&aws-storage-role=${AWS_STORAGE_ROLE}"

LKE

kubectl apply -f "https://www.hydrolix.io/operator/v5.3.0/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}"

Changelog

Updates

These changes include version upgrades and internal dependency bumps.

  • Adapted new packages to address a vulnerability: CVE-2025-27789.
  • Made changes to the following packages:
    • Removed next-pwa
    • axios v1.7.7 -> v1.8.2, avoid possible SSRF and credential leakage using absolute URL
    • re2 v1.21.4 -> re2js v1.1.0
    • jest-worker v26.2.1 -> v27.4.5
    • babel v7.25.x -> v7.27.1 (multiple sub-packages)
    • babel/compat-data v7.25.7 -> v7.27.2
    • babel/parser v7.25.6 -> v7.27.2
    • react-virtualized v9.22.5 -> v9.22.6
    • recharts v2.12.7 -> v2.15.3
    • eslint-plugin-jsx-a11y v6.7.1 -> v6.10.2
    • react-select v5.6.1 -> v5.10.1
    • eslint-config-next v13.5.6 -> v13.5.11
  • Upgraded the ring Rust library to address a possible denial of service vulnerability. See CVE-2025-4432.
  • Updated the following libraries to address a vulnerability where strncopy functions didn't properly handle null-terminated strings. See snprintf(3) for more details.
    • catch2 from 2.13.8 to 3.8.1
    • google-cloud-cpp from 1.30.1 to 2.17.0
    • protobuf from 3.17.1 to 6.30.2

Improvements

These changes improve behavior, resilience, or usability across components.

API

  • Expanded list names forbidden to user projects to include Hydrolix and ClickHouse internal project names.

Intake

  • Introduced merge-peer graceful shutdown. Added tracking of disappearing merge-peer as well as graceful shutdown to the merge-controller.
  • Introduced adaptive memory coefficient computation into merge-controller. This is a step toward obviating the table-level setting memory_coefficient.
  • Added missing configuration support for auto values under intake-rs, including table_revision and transform_id.
    These fields now correctly populate when specifying auto in the table configuration.
    This update ensures that automatic values like table_revision and transform_id work as expected, simplifying configuration and reducing setup errors.
  • Simplified and removed summary metrics in intake-head to reduce memory pressure on Prometheus.
    • Removed hdx_sink_partition_rows_summary.
    • Removed hdx_upload_obj_store_duration_ns.
    • Use only base pod labels on hdx_sink_bucket_maint_duration_ns for cardinality reduction.
    • Use only base pod labels on hdx_upload_process_write_result_duration_ns for cardinality reduction.
  • Introduced support for pool-level memory limit settings in merge-controller. Resource limits can be set globally, per-project, and per-table.
  • Added bucket_duration metric to track merge bucket closure timing.
    The new bucket_duration metric in merge-controller tracks how long buckets remain open before closing, with a basis label indicating whether the closure was due to:
    • full
    • idle_ttl
    • age_ttl
    • segment_ttl
      The older expired_buckets metric was removed.
  • Added counters to merge-controller tracking work progress. New metrics are partitions_dispatched and candidates_dispatched.
  • Added histograms merge-controller capturing count of partitions for each merge candidate partition_per_candidate and estimated memory size of dispatched candidates candidate_mem_size.
  • Added support for reading an optional configuration file in merge-controller managed by the operator. This enables operational reconfiguration of the merge-controller with per-cluster settings.
  • Removed limit derived from available CPU count for MAX_OUTSTANDING_REQUESTS. Also removed constraints on ACCEPT_DATA_TIMEOUT for all ingestion services. Now these variables are solely controlled by tunables.
  • Introduced a merge feature, allowing download of candidate partitions for local execution. Feature must be enabled with new tunable merge_download_partitions_enabled.

Core

  • Improved resilience when encountering corrupt partitions during query. The new behavior skips corrupted blocks, dropping rows read from affected blocks and resynchronizes on block boundaries. Errors will continue to be returned to merge systems when encountering corrupt blocks.
  • You can now set custom column and row delimiters when using ClickHouse dictionaries. Supports the following formats:
    • CustomSeparated
    • CustomSeparatedWithNames
    • CustomSeparatedWithNamesAndTypes
    • See Custom Dictionaries for more information.
  • Fixed segfault when a cancellation thread releases telem::SpanScope while a query thread is still running. Let the query thread release the resource.
  • Deduplicated different declarations of HdxQueryInfo struct. Earlier, the declaration used would be determined during the linking phase of build.
  • Fixed incorrect handling of summary table SQL expressions using AS to create column aliases. By converting computed alias columns to ClickHouse identifiers, summary table construction occurs correctly. This fixes errors like Code: 47. DB::Exception: Missing columns:.
  • Corrected build time SHA mismatch by always using exactly 8 characters of the SHA. Earlier, a spurious mismatch was suffixing the unnecessarily frightening string -dirty to a version output returned under some error conditions.

Operator

  • Added anomaly detection tenant which defaults off but supports configuration. Introduced new tunable hdx_anomaly_detection.
  • Fixed startup looping bug for in-cluster Superset visualization tool.
  • Introduced the database connection pooling tool pgbouncer at version 1.24.1 into the cluster.
  • Corrected a logic error that disabled basic authentication on certain endpoints when unified authentication was disabled. Setting unified_auth: false no longer allows unauthenticated access to these endpoints.
  • Disabled the MySQL listener on tcp/9004 by default. Individual clusters may still support plaintext MySQL query interface by setting tunables.
  • Supported fallback logic for incoming connections to service IP on tcp/9444. If chproxy is unavailable, queries will be passed directly to the query-head. This means that grafana installations can use the same public URL without change.
  • Provided a Hydrolix cluster resource object validator for use with Kubernetes tooling. This allows detection of incorrectly spelled tunables and other misconfigurations.
  • Disabled unified auth for in-cluster Grafana to work around a constant 401 response interaction with Google oAuth. Adjusted development scale profile and allow developers to suppress TLS requirement, by respecting the existing pg_ssl_mode tunable.
    (xref and).
  • Set the intake_head_raw_data_spill_config.enabled tunable to act as a string and a boolean type to better support its use in all versions of Hydrolix.
    API resources like tables and transforms can be modified and queried. Defaults to off. See Scale Your Cluster for more details.
  • Ensured that cool_down_seconds is transmitted to the hdxscaler k8s ConfigMap. Corrects a condition in which cluster was scaling down more rapidly than expected.

UI

  • Improved Search on Data page to select matches from both project and table names, to improve searchability on clusters with many projects. Earlier, only table names were searched for match.
  • Introduced safety measure to prevent deletion of the default transform. If a user accidentally tries to delete the current default transform on a table, a modal dialog will prevent this, advising to set another transform as a default before deleting.
  • Improved management of query options and switched to project- and table-level query options API calls. Now, it's possible to remove a query option from a project or table in the UI.
  • Added a field displaying the stream ingestion URL into the Table > stream settings sidebar.
  • Added nine new Linode entries to the Regions selector on the New bucket page. New regions: de-fra-1, us-ord-10, us-sea-9, us-iad-10, in-bom-1, jp-tyo-1, sg-sin-1, gb-lon-1, and au-mel-1.
  • Added a preview of output columns on the New transform page. This allows users to visualize output structure before finalizing transformation.