09 Jun 2025 - v5.2.1
18 days ago by Martin A. Brown
Intake spill-handling improvements, merge health metrics, and prometheus 3.2.1.
Notable new features
Performance and architectural improvements
- Redesigned spill handling for
intake-head
nodes by introducing a centralspill-controller
to coordinate spilled data distribution.
Defaults to theoff
state.Intake-head
nodes now poll the controller instead of independently scanning object storage, reducing operational costs and storage contention.- Spilled data is now stored in the storage target defined for each partition instead of always using the default storage.
- Local buffering before spills reduces the number of catalog-add operations during normal operation.
- Removed or modified several tunables related to spill functionality:
- Deleted entries in tunables
intake_head_catalog_spill_config
andintake_head_raw_data_spill_config
:max_concurrent_fetch
fetch_lock_expire_duration
num_partitions
empty_fetch_pause_duration
- Remaining entries
enabled
(default false)max_concurrent_spill
(default 20)max_attempts_spill
(default 5)
- Globally deprecated:
spill_locks_cleanup_enabled
spill_locks_cleanup_schedule
- Deleted entries in tunables
- These tunables are ignored if still present in the HydrolixCluster spec.
References to these tunables can be safely removed from the operator configuration.
Added Merge Health metrics and endpoints
- Added Merge Health calculation to measure partition merge efficiency.
- Introduced:
- DirectHistogram: A custom Prometheus histogram for externally calculated bucket counts.
- AdminService: Centralized service for calculating and caching merge efficiencies.
- Metric Reporter: Background process that periodically updates Prometheus metrics from AdminService data.
- Exposed data through:
- HTTP endpoints:
/admin/efficiencies
/admin/efficiency/{project_id}
/admin/efficiency/{project_id}/{table_id}
- Prometheus metrics:
efficiency
ideal_partition_count
actual_partition_count
partition_distribution
- HTTP endpoints:
- Metrics are labeled by:
- Project ID
- Project name
- Table ID
- Table name
- Merge pool
Prometheus upgrade
- Prometheus has been upgraded from version 2.55.1 to 3.2.1.
Breaking Changes
Support for CrunchyData Postgres Removed
Support for CrunchyData Postgres has been fully removed from the hdx-operator. Hydrolix now uses CloudNativePG (CNPG) instead.
To continue using an existing Crunchy cluster, set
catalog_db_host
manually before upgrading.
Upgrade
Upgrade on GKE
kubectl apply -f "https://www.hydrolix.io/operator/v5.2.1/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&gcp-storage-sa=${GCP_STORAGE_SA}"
Upgrade on EKS
kubectl apply -f "https://www.hydrolix.io/operator/v5.2.1/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&aws-storage-role=${AWS_STORAGE_ROLE}"
Upgrade on LKE
kubectl apply -f "https://www.hydrolix.io/operator/v5.2.1/operator-resources?namespace=$HDX_KUBERNETES_NAMESPACE"
Changelog
General
- Upgraded Go module dependencies to improve security and compatibility:
golang.org/x/crypto
v0.35.0 -> v0.36.0golang.org/x/net
v0.33.0 -> v0.38.0golang.org/x/sync
v0.11.0 -> v0.12.0golang.org/x/sys
v0.30.0 -> v0.31.0golang.org/x/term
v0.29.0 -> v0.30.0golang.org/x/text
v0.22.0 -> v0.23.0
API
- Standardized audit models and added two new fields:
resource_type
anduser_id
for easier filtering of audit records. - Improved search flexibility in the UI and API to support advanced filtering by
user_id
,resource_id
, andresource_type
. - Added new
Service Account
RBAC roles for more precise permission control. You can assign any combination of these roles:view_serviceaccount
: View all service accounts only.delete_serviceaccount
: Delete service accounts only.tokens_serviceaccount
: Generate access tokens only.add_serviceaccount
: Create new service accounts only.
- Added
super_user
ability to enable or disable users without removing their accounts from the database. Uses the Keycloakenabled
attribute.- A user with
disabled
status retains theiremailVerified
status and role information.
- A user with
- Removed unsupported endpoints that returned
405
errors and corrected OpenAPI schemas. - Reduced unnecessary audit log entries by limiting job status updates to active jobs, cutting down on noisy or no-op records.
Cluster Operations
- Added new Traefik tunables to control CORS settings and customize response headers.
These tunables allow configuration of HTTP middleware behavior based on Traefik header settings.traefik_service_cors_headers
: Set allowed origins, methods, headers, and credential settings for CORS behavior in Traefik.traefik_service_custom_response_headers
: Set custom HTTP response headers for Traefik services.
Core
- Added ClickHouse v24.4 functions:
seriesDecomposeSTL
arrayRandomSample
- f
romDaysSinceYearZero
toDaysSinceYearZero
byteSwap
formatQuery
formatQueryOrNull
formatQuerySingleLine
formatQuerySingleLineOrNull
- Upgraded turbine core and third-party libraries:
openssl
3.3.2 -> 3.5 LTSbison
3.7.1 -> 3.8double-conversion
3.1.5 -> 3.3.1fmt
9.1.0 -> 11.1.4spdlog
1.9.2 -> 1.15.2zlib
1.2.11 -> 1.3.1lz4
1.9.2 -> 1.10.0snappy
1.1.9 -> 1.2.2zstd
1.5.6 -> 1.5.7absl-cpp
20240722.0 -> 20250127.1c-ares
1.32.2 -> 1.34.4cxxopts
2.2.1 -> 3.2.0libpq
16.5 -> 17.4libpqxx
7.9.2 -> 7.10.1libiconv
1.16 -> 1.18libxml2
2.9.10 -> 2.14.1libcurl
7.73.0 -> 8.13.0nlohmann_json
3.11.2 -> 3.11.3protobuf
3.17.1 -> 30.2readerwriterqueue
1.0.5 -> 1.0.6google-cloud-cpp
1.30.1 -> 2.36.0pugixml
1.12.1 -> 1.15zookeeper-client-c
3.7.1 -> 3.9.3
Data
- Upgraded Golang crypto to address an SSH connection vulnerability:
golang.org/x/crypto
v0.31.0 -> v0.35.0
- Addressed a Tokio broadcast channel vulnerability (RUSTSEC-2025-0023) where unsafe cloning of
Send
but non-Sync
types could lead to unsound behavior.- Updated Tokio version from
1.43.0
->1.44.2
.
- Updated Tokio version from
- Changed
merge-controller
deployment strategy toRecreate
to avoid duplicate controllers during upgrades.- The
merge-controller
now fails to start if anothermerge-controller
ormerge-head
pod is already running.
- The
- Updated periodic job scheduling to support both 5-field and 6-field crontab formats to improve compatibility with traditional Unix cron syntax.
- 5-field (Minute Hour Day Month Day-of-Week) is now accepted by automatically inserting
0
for seconds.
- 5-field (Minute Hour Day Month Day-of-Week) is now accepted by automatically inserting
- Added support for scheduling
partition-cleaner
using a new cron-style tunable:partition_cleaner_schedule
.- This change prevents excessive object store API calls when no cleanup is needed.
- The default schedule is
0 0 * * *
, which runs the job once daily at midnight in local time.
- Reduced the number of buckets in the HTTP response duration histogram to simplify metrics and reduce overhead.
- Improved logging for periodic delete commands to show the number of files and amount of storage that would be deleted during dry runs.
- Improved handling of empty partitions. Empty summary partitions are now noted and skipped.
- Empty raw partitions are noted and skipped, excepting those with
PRESERVE_LOCAL
orHDX_INDEXER_DEBUG
set astrue
. - This prevents noisy logs when Hydrolix finds empty logs.
- Empty raw partitions are noted and skipped, excepting those with
Intake
- Improved bucket promotion for small ingests by adding creation and modification tracking to ingestion buckets.
- Buckets are now turned into candidates based on age or activity, ensuring timely merges even when partition volume is low.
- Added support for the
hdx_query_max_bytes_before_external_group_by
setting,
allowing fine control over when external grouping logic is applied during summary merges.
This value is now passed as an environment variable to intake nodes and included in applicable queries when set.
Merge
- Introduced the
AdaptivePartitionSource
, a PartitionSource implementation, as an internal catalog query controller.
This reduces unnecessary catalog queries during partition merges.
Partition queries now track efficiency using a ratio of new to duplicate partitions, and apply cooldowns based on recent results.- Prevents busy loops when no mergeable partitions are available.
- Supported cooldown strategies:
Exponential
(default): longer cooldown with lower efficiency. Default 2× factor between 0–60s.Noop
: no cooldown.Static
: fixed cooldown.
- Supports configuration overrides via
/etc/merge/adaptive_config_override.toml
.
Operator
- Improved ACME certificate generation by adding pre-validation checks for readiness.
- The
acme-renewal
job now uses BusyBox for lightweight HTTP checks through Traefik before requesting certificates. - Also updates
wait-for-dns.sh
to confirm DNS resolves to the correct external IP, supporting CNAMEs.
- The
- Added the
metadata
tunable for custom labels and annotations.
It propagates user-defined metadata to all Kubernetes workloads managed by the Hydrolix Operator: Deployments, StatefulSets, DaemonSets, and their pods.- Only
labels
andannotations
keys are considered. - Changes to
metadata
triggers restarts for all active workloads. Verify it's safe to restart before making changes. - Custom keys are validated to avoid conflicts with internal labels and must meet Kubernetes naming standards.
- Invalid
metadata
entries are ignored without alerts.
- Only
- Replaced Redpanda Kafka ingest with direct HTTP ingest through
intake-head
.- Added rollback support to revert to an earlier version for any issues with this release. The rollback regenerates the
hydro.logs
pool using Kafka and Redpanda. - Before regenerating the Kafka-peer-based
hydro.logs
, delete the Redpanda PersistentVolumeClaim (PVC) to avoid retaining stale topic data.
- Added rollback support to revert to an earlier version for any issues with this release. The rollback regenerates the
- Updated OpenSSL to address vulnerabilities:
- 0.10.71 -> 0.10.72
- 0.9.106 -> 0.9.107
- Added
monitor_request_timestamp
tohydro.monitor
records to capture the original time a request was made.- This change improves detection of delayed or dropped requests, especially when retries occur after HTTP 429 responses.
- Improved
hdx-scaler
service:- Dynamic configuration changes now apply without requiring restarts.
- Fixed an issue where setting
target_value: 0
could cause errors. The scaler now uses the minimum number of replicas whentarget_value
is zero. - Added support for autoscaling deployments that aren't linked to Kubernetes Services. Supported deployment types include:
alter-head
alter-peer
autoingest
batch-head
batch-peer
kafka-peer
(pooled)merge-head
merge-peer
(pooled)query-peer
(pooled)reaper
stream-peer
(pooled)
- Improved how metrics are handled. Metrics with Not-a-Number (
NaN
) values are now treated as zero to prevent errors and keep scaling behavior consistent. - Added support for scaling and targeting specific pools.
UI
- Added a confirmation modal to the UI when cancelling an alter job.
- Updated Next.js to version 14.2.26 to fix
x-middleware-subrequest-id
leak vulnerability. - The transforms UI now uses React virtual tables. This and other changes allows sortable/resizable column headers, support for displaying much larger tables of data, column filters, and a full page width layout option. Jira Jira Jira MR
- Improved security in the UI by preventing user-supplied regular expressions from causing Regular Expression Denial of Service (ReDoS) attacks.
Thecomponents-kit
page now uses a safer regex handling method. - Added support for coverage settings in the UI transform creation template.
Bug fixes
Hydrolix engine
- A bug has been fixed in an optimization related to
ORDER BY LIMIT
. It now returns the full set of rows expected. - UI
- The Column Analysis tab now retains selected columns when switching tabs or screens.
- The Validate button in the Transforms UI is no longer hidden when using Safari.
- Merge and Data Lifecycle
- The
memory_coefficient
setting and summaryenable
flags had been erroneously disabled. They're now reenabled to ensure normal operation.
- The
Core
- Fixed a bug introduced by an LRU cache update that caused malformed path reads from presigned URLs.
Presigned URL queries throughturbine_url
now return correct results without file name errors. - Fixed an issue where the LRU cache culling didn't reliably activate in some environments. LRU cache culling logic now reliably triggers to prevent storage exhaustion.