09 Jun 2025 - v5.2.1

Intake spill-handling improvements, merge health metrics, and prometheus 3.2.1.

Notable new features

Performance and architectural improvements

Redesigned spill handling for intake-head nodes by introducing a central spill-controller to coordinate spilled data distribution.
Defaults to the off state.
- Intake-head nodes now poll the controller instead of independently scanning object storage, reducing operational costs and storage contention.
- Spilled data is now stored in the storage target defined for each partition instead of always using the default storage.
- Local buffering before spills reduces the number of catalog-add operations during normal operation.
Removed or modified several tunables related to spill functionality:
- Deleted entries in tunables intake_head_catalog_spill_config and intake_head_raw_data_spill_config:
  - max_concurrent_fetch
  - fetch_lock_expire_duration
  - num_partitions
  - empty_fetch_pause_duration
- Remaining entries
  - enabled (default false)
  - max_concurrent_spill (default 20)
  - max_attempts_spill (default 5)
- Globally deprecated:
  - spill_locks_cleanup_enabled
  - spill_locks_cleanup_schedule
These tunables are ignored if still present in the HydrolixCluster spec.
References to these tunables can be safely removed from the operator configuration.

Added Merge Health metrics and endpoints

Added Merge Health calculation to measure partition merge efficiency.
Introduced:
- DirectHistogram: A custom Prometheus histogram for externally calculated bucket counts.
- AdminService: Centralized service for calculating and caching merge efficiencies.
- Metric Reporter: Background process that periodically updates Prometheus metrics from AdminService data.
Exposed data through:
- HTTP endpoints:
  - /admin/efficiencies
  - /admin/efficiency/{project_id}
  - /admin/efficiency/{project_id}/{table_id}
- Prometheus metrics:
  - efficiency
  - ideal_partition_count
  - actual_partition_count
  - partition_distribution
Metrics are labeled by:
- Project ID
- Project name
- Table ID
- Table name
- Merge pool

Prometheus upgrade

Prometheus has been upgraded from version 2.55.1 to 3.2.1.

Breaking Changes

🚧
Support for CrunchyData Postgres Removed
Support for CrunchyData Postgres has been fully removed from the hdx-operator. Hydrolix now uses CloudNativePG (CNPG) instead.
To continue using an existing Crunchy cluster, set catalog_db_host manually before upgrading.

Upgrade

Upgrade on GKE

kubectl apply -f "https://www.hydrolix.io/operator/v5.2.1/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&gcp-storage-sa=${GCP_STORAGE_SA}"

Upgrade on EKS

kubectl apply -f "https://www.hydrolix.io/operator/v5.2.1/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&aws-storage-role=${AWS_STORAGE_ROLE}"

Upgrade on LKE

kubectl apply -f "https://www.hydrolix.io/operator/v5.2.1/operator-resources?namespace=$HDX_KUBERNETES_NAMESPACE"

Changelog

General

Upgraded Go module dependencies to improve security and compatibility:
- golang.org/x/crypto v0.35.0 -> v0.36.0
- golang.org/x/net v0.33.0 -> v0.38.0
- golang.org/x/sync v0.11.0 -> v0.12.0
- golang.org/x/sys v0.30.0 -> v0.31.0
- golang.org/x/term v0.29.0 -> v0.30.0
- golang.org/x/text v0.22.0 -> v0.23.0

API

Standardized audit models and added two new fields: resource_type and user_id for easier filtering of audit records.
Improved search flexibility in the UI and API to support advanced filtering by user_id, resource_id, and resource_type.
Added new Service Account RBAC roles for more precise permission control. You can assign any combination of these roles:
- view_serviceaccount: View all service accounts only.
- delete_serviceaccount: Delete service accounts only.
- tokens_serviceaccount: Generate access tokens only.
- add_serviceaccount: Create new service accounts only.
Added super_user ability to enable or disable users without removing their accounts from the database. Uses the Keycloak enabled attribute.
- A user with disabled status retains their emailVerified status and role information.
Removed unsupported endpoints that returned 405 errors and corrected OpenAPI schemas.
Reduced unnecessary audit log entries by limiting job status updates to active jobs, cutting down on noisy or no-op records.

Cluster Operations

Added new Traefik tunables to control CORS settings and customize response headers.
These tunables allow configuration of HTTP middleware behavior based on Traefik header settings.
- traefik_service_cors_headers: Set allowed origins, methods, headers, and credential settings for CORS behavior in Traefik.
- traefik_service_custom_response_headers: Set custom HTTP response headers for Traefik services.

Core

Added ClickHouse v24.4 functions:
- seriesDecomposeSTL
- arrayRandomSample
- fromDaysSinceYearZero
- toDaysSinceYearZero
- byteSwap
- formatQuery
- formatQueryOrNull
- formatQuerySingleLine
- formatQuerySingleLineOrNull
Upgraded turbine core and third-party libraries:
- openssl 3.3.2 -> 3.5 LTS
- bison 3.7.1 -> 3.8
- double-conversion 3.1.5 -> 3.3.1
- fmt 9.1.0 -> 11.1.4
- spdlog 1.9.2 -> 1.15.2
- zlib 1.2.11 -> 1.3.1
- lz4 1.9.2 -> 1.10.0
- snappy 1.1.9 -> 1.2.2
- zstd 1.5.6 -> 1.5.7
- absl-cpp 20240722.0 -> 20250127.1
- c-ares 1.32.2 -> 1.34.4
- cxxopts 2.2.1 -> 3.2.0
- libpq 16.5 -> 17.4
- libpqxx 7.9.2 -> 7.10.1
- libiconv 1.16 -> 1.18
- libxml2 2.9.10 -> 2.14.1
- libcurl 7.73.0 -> 8.13.0
- nlohmann_json 3.11.2 -> 3.11.3
- protobuf 3.17.1 -> 30.2
- readerwriterqueue 1.0.5 -> 1.0.6
- google-cloud-cpp 1.30.1 -> 2.36.0
- pugixml 1.12.1 -> 1.15
- zookeeper-client-c 3.7.1 -> 3.9.3

Data

Upgraded Golang crypto to address an SSH connection vulnerability:
- golang.org/x/crypto v0.31.0 -> v0.35.0
Addressed a Tokio broadcast channel vulnerability (RUSTSEC-2025-0023) where unsafe cloning of Send but non-Sync types could lead to unsound behavior.
- Updated Tokio version from 1.43.0 -> 1.44.2.
Changed merge-controller deployment strategy to Recreate to avoid duplicate controllers during upgrades.
- The merge-controller now fails to start if another merge-controller or merge-head pod is already running.
Updated periodic job scheduling to support both 5-field and 6-field crontab formats to improve compatibility with traditional Unix cron syntax.
- 5-field (Minute Hour Day Month Day-of-Week) is now accepted by automatically inserting 0 for seconds.
Added support for scheduling partition-cleaner using a new cron-style tunable: partition_cleaner_schedule.
- This change prevents excessive object store API calls when no cleanup is needed.
- The default schedule is 0 0 * * *, which runs the job once daily at midnight in local time.
Reduced the number of buckets in the HTTP response duration histogram to simplify metrics and reduce overhead.
Improved logging for periodic delete commands to show the number of files and amount of storage that would be deleted during dry runs.
Improved handling of empty partitions. Empty summary partitions are now noted and skipped.
- Empty raw partitions are noted and skipped, excepting those with PRESERVE_LOCAL or HDX_INDEXER_DEBUG set as true.
- This prevents noisy logs when Hydrolix finds empty logs.

Intake

Improved bucket promotion for small ingests by adding creation and modification tracking to ingestion buckets.
- Buckets are now turned into candidates based on age or activity, ensuring timely merges even when partition volume is low.
Added support for the hdx_query_max_bytes_before_external_group_by setting,
allowing fine control over when external grouping logic is applied during summary merges.
This value is now passed as an environment variable to intake nodes and included in applicable queries when set.

Merge

Introduced the AdaptivePartitionSource, a PartitionSource implementation, as an internal catalog query controller.
This reduces unnecessary catalog queries during partition merges.
Partition queries now track efficiency using a ratio of new to duplicate partitions, and apply cooldowns based on recent results.
- Prevents busy loops when no mergeable partitions are available.
- Supported cooldown strategies:
  - Exponential (default): longer cooldown with lower efficiency. Default 2× factor between 0–60s.
  - Noop: no cooldown.
  - Static: fixed cooldown.
- Supports configuration overrides via /etc/merge/adaptive_config_override.toml.

Operator

Improved ACME certificate generation by adding pre-validation checks for readiness.
- The acme-renewal job now uses BusyBox for lightweight HTTP checks through Traefik before requesting certificates.
- Also updates wait-for-dns.sh to confirm DNS resolves to the correct external IP, supporting CNAMEs.
Added the metadata tunable for custom labels and annotations.
It propagates user-defined metadata to all Kubernetes workloads managed by the Hydrolix Operator: Deployments, StatefulSets, DaemonSets, and their pods.
- Only labels and annotations keys are considered.
- Changes to metadata triggers restarts for all active workloads. Verify it's safe to restart before making changes.
- Custom keys are validated to avoid conflicts with internal labels and must meet Kubernetes naming standards.
- Invalid metadata entries are ignored without alerts.
Replaced Redpanda Kafka ingest with direct HTTP ingest through intake-head.
- Added rollback support to revert to an earlier version for any issues with this release. The rollback regenerates the hydro.logs pool using Kafka and Redpanda.
- Before regenerating the Kafka-peer-based hydro.logs, delete the Redpanda PersistentVolumeClaim (PVC) to avoid retaining stale topic data.
Updated OpenSSL to address vulnerabilities:
- 0.10.71 -> 0.10.72
- 0.9.106 -> 0.9.107
Added monitor_request_timestamp to hydro.monitor records to capture the original time a request was made.
- This change improves detection of delayed or dropped requests, especially when retries occur after HTTP 429 responses.
Improved hdx-scaler service:
- Dynamic configuration changes now apply without requiring restarts.
- Fixed an issue where setting target_value: 0 could cause errors. The scaler now uses the minimum number of replicas when target_value is zero.
- Added support for autoscaling deployments that aren't linked to Kubernetes Services. Supported deployment types include:
  - alter-head
  - alter-peer
  - autoingest
  - batch-head
  - batch-peer
  - kafka-peer (pooled)
  - merge-head
  - merge-peer (pooled)
  - query-peer (pooled)
  - reaper
  - stream-peer (pooled)
- Improved how metrics are handled. Metrics with Not-a-Number (NaN) values are now treated as zero to prevent errors and keep scaling behavior consistent.
- Added support for scaling and targeting specific pools.

UI

Added a confirmation modal to the UI when cancelling an alter job.
Updated Next.js to version 14.2.26 to fix x-middleware-subrequest-id leak vulnerability.
The transforms UI now uses React virtual tables. This and other changes allows sortable/resizable column headers, support for displaying much larger tables of data, column filters, and a full page width layout option. Jira Jira Jira MR
Improved security in the UI by preventing user-supplied regular expressions from causing Regular Expression Denial of Service (ReDoS) attacks.
The components-kit page now uses a safer regex handling method.
Added support for coverage settings in the UI transform creation template.

Bug fixes

Hydrolix engine

A bug has been fixed in an optimization related to ORDER BY LIMIT. It now returns the full set of rows expected.
UI
- The Column Analysis tab now retains selected columns when switching tabs or screens.
- The Validate button in the Transforms UI is no longer hidden when using Safari.
Merge and Data Lifecycle
- The memory_coefficient setting and summary enable flags had been erroneously disabled. They're now reenabled to ensure normal operation.

Core

Fixed a bug introduced by an LRU cache update that caused malformed path reads from presigned URLs.
Presigned URL queries through turbine_url now return correct results without file name errors.
Fixed an issue where the LRU cache culling didn't reliably activate in some environments. LRU cache culling logic now reliably triggers to prevent storage exhaustion.

Notable new features

Performance and architectural improvements

Added Merge Health metrics and endpoints

Prometheus upgrade

Breaking Changes

🚧Support for CrunchyData Postgres Removed

Upgrade

Upgrade on GKE

Upgrade on EKS

Upgrade on LKE

Changelog

General

API

Cluster Operations

Core

Data

Intake

Merge

Operator

UI

Bug fixes

Hydrolix engine

Core

🚧
Support for CrunchyData Postgres Removed