Skip to content

v5.11.6

Notable new features⚓︎

Query head pools⚓︎

Clusters now support query head pools to allow definition of parallel, fully-isolated query infrastructure in a cluster. Earlier, all query pools and their clients shared the same query head. Now, both query head and query peer pools can be created and managed independently.

Query head pooling is supported for ClickHouse protocols. When the tunable query_head_pooling_enabled is enabled, incoming connections pass a protocol-specific reverse proxy, which selects the query head pool based on the ClickHouse database parameter of the incoming connection. Client applications must be updated with a database parameter to make use of this feature.

Routing rules in the HydrolixCluster spec direct connections to a specific query head pool.

The feature isn't enabled by default, but can be turned on with tunable query_head_pooling_enabled.

Autoscaler rewrite in Go⚓︎

The Hydrolix Kubernetes autoscaler has been rewritten in Go for efficiency, flexibility, and observability improvements. It can completely replace the Python-based hdx-scaler, which remains the default option. Existing scaling behavior is preserved, and the hdxscalers configuration in the HydrolixCluster spec still works the same way.

New capabilities in hdx-scaler-go:

  • VPA for DaemonSets: Vertical pod autoscaling now supports DaemonSet workloads (set workload_kind: DaemonSet), enabling resource right-sizing for components like cAdvisor.
  • Suspended and dry run modes: Pause scaling without losing accumulated state (suspended: true), or observe what the scaler would do without applying changes (dry_run: true).
  • GOMEMLIMIT autoinjection: For Go services, the scaler can automatically set the GOMEMLIMIT environment variable as a percentage of the memory limit. Edit the VPAConfig custom resource directly to enable.
  • Validating webhooks: Custom Resource Definition (CRD) admission webhooks catch configuration errors at apply time.

Enable the new scaler with spec.scale.hdx-scaler.replicas: 0 and spec.scale.hdx-scaler-go.replicas: 1 in the HydrolixCluster spec. The Go scaler doesn't include a terminal UI. Use kubectl get hpaconfig -o wide and Prometheus metrics on :27183/metrics instead.

SQL validation on transforms and functions⚓︎

The Config API exposes a new /parse_sql/ API endpoint for validating SQL syntax in transforms, functions, and table merge settings. It catches syntax errors, such as missing commas, unmatched parentheses, or references to non-existent columns at configuration time rather than at query or ingest time. Validation errors return an HTTP 400 response and a description of what's wrong.

In a future release, endpoints that accept SQL for transforms, functions, and table merge settings will require the validation to succeed.

Breaking changes⚓︎

  • The /users endpoint no longer returns users and service accounts in a single response.

    The /users endpoint, however, provides an interim compatibility query parameter search_service_accounts=true, which returns only service accounts. Client applications should ultimately switch to the dedicated /service_accounts endpoint instead. See also API endpoints for users and service accounts.

    Refactored the /users endpoint to separate users and service accounts. This corrects pagination that was previously broken for large userbases (1000+ Keycloak users). The endpoint now supports query parameters for filtering by email, username, first_name, last_name, search, and exact. Keycloak-to-database user sync has been moved to a scheduled background task to eliminate race conditions on concurrent requests.

    Service accounts will be inaccessible from this endpoint in a subsequent release.

    [Team API] Jira PR

Deprecation notices⚓︎

  • merge-head is deprecated and will be removed in v6.0. If your configuration explicitly uses merge-head, migrate to merge-controller before upgrading to v6.0. merge-controller has been the default since v5.10.

  • Legacy internal user accounts hdx@hydrolix.net and hdx.readonly@hydrolix.net are deprecated and will be removed in a future release.

    Starting in this release, cluster-internal automations use service accounts internal.hdx and internal.hdx.readonly. These will replace the legacy administrative user accounts.

    If you have external scripts, SIEM connectors, or integrations that authenticate using these legacy accounts, create a dedicated service account with the permissions your integration needs, issue a service account token, and configure the external service to use the new token. Automations should switch to using service accounts. See Manage Service Accounts.

    See Audit reporting for internal accounts for guidance on discovering usage.

  • Introduced new table-specific endpoint /bucket_settings for managing both spread_list and column_value_mapping. This replaces the separate /column_value_mapping endpoint, which is now deprecated.

Don't skip minor versions when upgrading or downgrading

Skipping versions when upgrading or downgrading Hydrolix can result in database schema inconsistencies and cluster instability. Always upgrade or downgrade sequentially through each minor version.

Example:
Upgrade from 5.7.95.8.65.9.5, not 5.7.95.9.5.

Apply the new Hydrolix operator⚓︎

If you have a self-managed installation, apply the new operator directly with these kubectl command examples. If you're using Hydrolix-supplied tools to manage your installation, follow the procedure prescribed by those tools.

GKE⚓︎

kubectl apply -f "https://www.hydrolix.io/operator/v5.11.6/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&gcp-storage-sa=${GCP_STORAGE_SA}"

EKS⚓︎

kubectl apply -f "https://www.hydrolix.io/operator/v5.11.6/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&aws-storage-role=${AWS_STORAGE_ROLE}"

LKE and AKS⚓︎

kubectl apply -f "https://www.hydrolix.io/operator/v5.11.6/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}"

Monitor the upgrade process⚓︎

Kubernetes jobs named init-cluster and init-turbine-api will automatically run to upgrade your entire installation to match the new operator's version number. This will take a few minutes, during which time you can observe your pods' restarts with your Kubernetes monitor tool.

Ensure both the init-cluster and init-turbine-api jobs have completed successfully and that the turbine-api pod has restarted without errors. After that, view the UI and use the API of your new installation as a final check.

If the turbine-api pod doesn't restart successfully, or other functionality is missing, check the logs of the init-cluster and init-turbine-api jobs for details about failures. This can be done using the k9s utility or with the kubectl command:

% kubectl logs -l app=init-cluster
% kubectl logs -l app=init-turbine-api

If you still need help, contact Hydrolix support.

Changelog⚓︎

Updates⚓︎

Intake updates⚓︎

  • Updated the Rust object_store library to version 0.11 → 0.12 and adapted code to the changed types. PR

Cluster operations updates⚓︎

  • Updated the following Python dependencies across internal services to address known security vulnerabilities:

  • Upgraded the bundled chproxy-based http-proxy to v0.6.2, adding support for query head pooling.

  • Updated Rust dependencies across hdx-gate, hdx-node-rs, and hdx-pod-metrics to address two critical HTTP request-smuggling vulnerabilities (CVE-2026-2833 and CVE-2026-2835) in pingora-core, updated to 0.8.0.

  • Updated the default version of the Anomaly Detection LLM proxy. Corrects a bug in which the AWS Secret Access Keys were transmitted incorrectly in the litellm_params block instead of the secured block.

    • janus v0.2.1 → v0.2.2

Config API updates⚓︎

  • Updated third-party Python dependencies for the Config API:
    • django 5.2.9 → 5.2.12
    • django-guardian 3.2.0 → 3.3.0
    • urllib3 2.6.2 → 2.6.3
    • gunicorn[gevent] 23.0.0 → 25.1.0
    • psycopg[binary,pool] 3.3.0 → 3.3.3
    • pyjwt[crypto] 2.10.1 → 2.11.0
    • prometheus-client 0.23.0 → 0.24.1
    • werkzeug 3.1.4 → 3.1.6

Improvements⚓︎

Intake improvements⚓︎

  • Introduced a priority-based scheduling model with two tiers to prevent large bulk imports from starving latency-sensitive autoingest jobs. Now, the autoingest service enqueues jobs at the higher priority, and can skip expensive listing operations.

  • Refactored the indexing pipeline to separate responsibilities in the ingestion system. Summary table indexing is now logically independent of raw table indexing, allowing the former to be managed in a separate process. The improvements make the software more amenable to communication over RPC, opening up opportunity for performance improvements.

  • Batched catalog deletes in the reaper instead of issuing one per partition, reducing Postgres WAL pressure during large-scale reap operations. Also fixed a bug where failed reap events were incorrectly acknowledged instead of being returned to the queue.

  • Reduced memory consumption and improved task scheduling efficiency on clusters with many storage definitions.

Core improvements⚓︎

  • Added column encoding_scheme to table metadata queries. This exposes the encoding scheme for each underlying table column.

  • Added endpoint /ping, identical to /healthcheck, to the ClickHouse HTTP server. It returns OK if the server is up and healthy.

  • Enabled distributed_aggregation_memory_efficient by default for all queries, allowing the query head to spill GROUP BY aggregation states to disk when memory thresholds are exceeded. This reduces the risk of OOM errors on large analytical queries.

Query improvements⚓︎

  • Added parameterized query support on the HTTP query interface. Pass param_-prefixed parameters (for example, param_user_id=123) that substitute into {name:Type} placeholders in SQL queries to improve query security.

  • Added streaming query support to the HTTP Query API. Clients can now specify query option hdx_query_streaming_result=true as an HTTP header or query parameter. The server no longer materializes the entire query result, which decreases the time to first byte (TTFB) of the response data. The X-Hdx-Query-Stats are emitted as the last line of the response body.

  • Added dynamic query peer availability polling and refined partition assignment strategy in the query head. Instead of assigning all partitions immediately, the query head assigns work in a rolling fashion to take advantage of fast and newly-available peers. By decreasing batch sizes assigned to query peers, memory usage is easier to predict. This should reduce out-of-memory (OOM) events in query infrastructure.

Cluster operations improvements⚓︎

  • Added feature to render HDX custom resource specifications exhaustively, including defaults. Prior to this feature, existing outputs lacked empty fields which were translated to defaults during cluster operation. This feature allows external infrastructure management systems to detect all configuration changes.

  • Each workload connecting to the catalog database now uses its own dedicated Postgres user named after the workload (for example: merge_head, query_peer, batch_head), replacing the shared query_api user. This makes it easier to identify per-component connection counts and query activity in pg_stat_activity during troubleshooting. The query_api user is retained for rollback compatibility but is no longer actively used.

  • Rewrote the ACME certificate management service in Go, replacing the previous bash-based implementation. The service now runs as a persistent deployment with automatic renewal, retry with backoff on transient failures, and support for wildcard certificates and subject alternative names (SANs).

  • Added a traefik_read_timeout tunable to handle slow data sources. If the timeout is reached, Traefik closes the connection and responds with an HTTP 499 Client Closed Request error. Default value is 60 (seconds).

  • Introduced a descheduler pod that evaluates node utilization and periodically evicts pods from underutilized nodes so the scheduler can repack them onto busier nodes. This helps free up empty nodes during downscaling to manage cloud costs. The pod runs the Kubernetes SIG Descheduler (v0.35.0).

  • Introduced an optional kube-scheduler with two strategies for influencing node assignment. With MostAllocated, services are assigned to minimize node count to control costs. Strategy LeastAllocated spreads services across available nodes. Services can be assigned to this non-default, optional scheduler.

  • Added defaultQueryTimeRange and per-index queryTimeRange configuration properties to Kibana Gateway. The default query time range is 24h, with support for table-level overrides using index settings.

Config API improvements⚓︎

  • Added new organization-, project-, and table-level circuit breaker query options hdx_query_max_perc_before_external_sort and hdx_query_max_memory_usage_perc. Use these to limit percentage of memory devoted to in-memory sorting before spilling to disk and maximum allowable memory, instead of using the similarly named query options which accept bytes.

  • Implemented support for Kafka SASL Salted Challenge Response Authentication Mechanism credential type. Known as kafka_sasl_scram, in-cluster clients can now authenticate to Kafka servers requiring SCRAM-SHA-256 and SCRAM-SHA-512 SASL mechanisms.

  • Added SQL validation for transforms, functions, and table merge settings using the hdx_verify_sql() function. Prepares the Config API to validates SQL syntax at creation and update time in a future release, catching errors before they reach query execution.

  • Dictionary file API responses now include a sha256 hash alongside the existing cloud-specific etag. The SHA256 hash is consistent across cloud providers, improving change detection in multi-cloud deployments. Existing dictionary files will show null for the sha256 field until re-uploaded.

UI improvements⚓︎

  • Improved the "All" option in the Log Level dropdown on the System Health page to select all individual log levels rather than applying an "All" label. Users can now deselect individual levels after selecting "All" for more granular filtering.

Bug fixes⚓︎

Core fixes⚓︎

  • Partition locks and generated partitions are now properly cleaned up when an alter job fails or is canceled. Previously, stale locks from a failed alter job would persist indefinitely, blocking subsequent alter operations on the same partitions, and couldn't be removed by the prune-locks job because they still had associated catalog rows.

  • Fixed a denullify: true bug on complex types in summary tables. Some subelements with denullify enabled weren't propagated to summary tables, causing type mismatch errors during data ingestion.

  • Fixed a bug in calculating disk full danger percentage. Earlier, the red zone was incorrectly calculated as only 1 + normal cache cull threshold, 76% instead of the expected 90%.

API fixes⚓︎

  • Enforced shard key validation upon creation using SQL identifier validation rules. Earlier, invalid column names were incorrectly accepted as shard keys.

  • Expanded conditions under which alter jobs can be canceled to include failed jobs. Earlier, a failed job had to be retried before it could be canceled.

  • Added workaround for default recursive listing when creating storages on Azure buckets. When many files exist in a bucket, the Config API could time out awaiting a response from Azure's list_objects without delimiter support. This fix uses Azure's walk_blobs which accepts delimiters.

  • Ignored uuid in request bodies, by declaring it read-only. Earlier, a difference between the UUID in the path and the body resulted in surprising behavior. Now, uuid in POST is ignored and the actual UUID is generated by the system. A uuid in a PUT or PATCH method is also ignored.

  • Updated table properties memory_coefficient and token_auth_enabled to accept null value on PUT and PATCH requests. This aligns with null being the default on table creation.

  • Enforced shard_key immutability on PUT requests. Previously, a PUT that omitted the shard_key field, either by excluding settings entirely or sending an empty settings: {}, could silently clear the existing value.

  • Improved error handling for invalid SQL during summary table creation and updates. Previously, empty or malformed SQL in the settings.summary.sql field caused a 500 Internal Server Error instead of returning a 400 Bad Request with a descriptive validation message.

  • Resolved a 500 error on the /users endpoint when exclude_pending=false on clusters using Google OAuth. The user serializer assumed a username attribute was always present, which isn't the case for OAuth-provisioned users.

  • Updating a project's deployment_id to its default (autogenerated) value is now idempotent. The uniqueness validation was incorrectly flagging the project's own current deployment_id as a duplicate.

  • Improved logic for sending and deleting invites. Earlier, race conditions between multiple simultaneous invites could result in HTTP 500 errors. Also, allowed pending invites to be deleted even if a job exists or the user is protected. Protected users are never deleted.

  • Restored appropriate permissions allowing authorized users to resend invites, create bulk invites, and claim invites. Earlier, under some circumstances, users who should have been authorized were receiving HTTP 403.

  • Allowed autoingest table PATCH requests to accept null for source_credential and bucket_credential. When null is provided, the credentials now fall back to cluster defaults, matching the behavior on table creation. Previously, passing null for these fields caused a validation error.

  • Allowed the catalog upload endpoint to accept blank, null, or omitted shard_key values. In all three cases, the API now coerces the value to the default 42bc986dc5eec4d3 (meaning "no particular value") instead of rejecting the request with an HTTP 400 error.

  • The API now returns an HTTP 400 Bad Request error when attempting to delete a project that has a pending alter job. Previously, the delete returned a 500 Internal Server Error from an uncaught exception, and the alter job status was changed to Canceled.

  • Added validation to prevent setting suppress, virtual, or ignore on primary timestamp fields. A migration automatically corrects any existing transforms or templates with these invalid settings on upgrade.

  • Fixed the /config/v1/ root endpoint to return 403 Forbidden instead of 500 Internal Server Error when accessed without sufficient permissions.

  • Fixed a SIEM-related database migration that could cause init-turbine-api to fail during cluster upgrade if a dynamic secret had not been provisioned.

  • Improved the migration speed of SIEM data source access_details to credentials by caching necessary secrets during migration and suppressing the usual publication until complete. The one-time migration now completes quickly with fewer resources.

  • Removed race condition in which an OAuth token issued before a password reset is accepted. Now, the Config API correctly rejects tokens if the password has changed since the token was issued.

  • Corrected an RBAC migration script to create permission ingest_table if it doesn't exist prior to the permissions adjustment.

  • Fixed an error preventing new clusters from sending initial invitation emails. Validation requirements for API users differ from administrative users with Kubernetes access. Now, the latter can send invitations without validation.

  • Changed the account used when executing the hdx_verify_sql table function in the /config/v1/parse_sql endpoint. Earlier, the requesting account's permissions were checked when calling into the separate container. This would fail if the user lacked the select_sql permission. Now, any authenticated user can call the endpoint.

  • Expanded the validation for Kinesis checkpointers to include GCP URLs. Earlier, only AWS ARNs were accepted.

  • Limited the protected table list to the hydro project. Earlier, tables named audit_logs, monitor, and logs were inadvertently protected and undeletable in other projects.

  • Retired usage of an ancient statisticial package, in favor of simple inline implementations for several functions used by the presigned URL code. This avoids an unnecessary dependency.

Intake fixes⚓︎

  • Fixed the akamai_siem_source_request_error_count Prometheus metric for Akamai SIEM ingest sources to correctly register with the required return_code and error labels.

  • Corrected purge handling for all jobs. Formerly, some cancelled and failed jobs would have NULL in the updated_at column, invisible to the purge logic. First, updated_at is set in all job management code paths. Second, the purge query relies on the created_at variable when updated_at is NULL for cleaning up orphaned tasks. The job_purge_age default is also now 48 hours instead of 90 days.

  • Corrected reporting, handling, and logging of failed tasks in the batch controller, which manifested as a stuck batch job. Now, any task failure causes the job to fail.

  • Fixed a regression introduced during batch system refactoring. The bug caused the reporting API to return incorrect values for estimated when handling alter jobs.

  • Resolved a race condition in intake head shutdown sequence that could cause a panic when in-flight HTTP handlers attempted to write to an already-closed channel. A two-phase shutdown coordination now ensures all active handlers complete before channel cleanup runs.

  • Fixed the name of a protocol in the protocols list. The raw protocol was formerly not discoverable.

  • Resolved a bug where summary table ingestion could fail with a Missing columns error when a column referenced by the summary transform was suppressed in the raw table. The failure was latent; it only surfaced when an unrelated schema change invalidated a stale internal cache used during partition key computation.

  • Fixed empty partitions being incorrectly treated as errors.

UI fixes⚓︎

  • Fixed the storage and user edit pages in the UI to wait for fresh data before rendering the form. This prevents cached data from overwriting fresh data.

  • Suppressed the 10,000-row limit warning in the query editor for INSERT INTO statements. Previously, the warning appeared for all queries, including INSERT statements that don't return result sets.

  • Improved error handling in the UI when deleting a storage that's still referenced in a table's storage map. The UI now surfaces the API's descriptive error message instead of showing a generic "Something went wrong" dialog.

  • Fixed the Users and Pending Invites lists in the UI to automatically refresh after deleting a user or invite. Previously, deleted entries remained visible until the user switched tabs or manually refreshed the page.

  • The UI now revalidates cluster version data on login to avoid serving stale cached values. Before this change, the version number was missing after logging in until the page was manually refreshed.

  • Corrected the breadcrumb navigation to display the SIEM name instead of showing a missing or incorrect label when viewing a SIEM configuration page.

  • Fixed the Column Policies page to display a column's current name consistently. Previously, when a column had an additional name set as current, the page showed the original field name in the selection list but the additional name in the blocked list.

  • The Service Accounts list page now displays assigned roles. Also fixed incorrect role display when navigating pagination.

Cluster operations fixes⚓︎

  • Added missing apiUrl property to the Kibana Gateway instance and set the default software version to 1.1.23. This allows the Kibana Gateway software to reach the cluster's Config API.

  • Removed static validations of scaling fields and application targets in the HDX Autoscaler. Now uses the centrally-defined fields for Horizontal (HPA) and Vertical Pod Autoscaling (VPA) and the scale settings. This corrects validation discrepancies.

  • Explicitly set limit_cpu in the definition of the operator's own Kubernetes Deployment object to avoid a corner case of definition order. Other settings like overcommit aren't subject to this problem.

  • Fixed incorrect type validation for kubernetes_premium_storage_class and kubernetes_storage_class tunables.

  • Gave each init job (backup-keycloak-db, cert-expiry-check, check-bucket-access, init-turbine-api, load-sample-project, wait-for-db-access) its own scale profile instead of sharing the init-cluster scale. Previously, setting spec.scale.init-turbine-api.memory: 1Gi in the cluster spec had no effect because each job's scale lookup key was hardcoded to init-cluster in the operator, so only changes to spec.scale.init-cluster were applied.

  • Increased the scale_min memory floor for init-turbine-api from the inherited default of 512Mi to 2Gi. This reduced out-of-memory (OOM) kills and CrashLoopBackOff errors when scale_min: true was set on the cluster.

  • Fixed false validation errors in HDX Scaler configuration (for example, per_pod reported as invalid, intake-router reported as an unknown service). The validator used a static list of services and settings that was incomplete; it now uses actual service names generated by the operator.

  • Corrected two race conditions in IP and service management. Prevent service list and IP desynchronization by always refetching all EndpointSlices whenever the service list changes and correctly handling DELETE events.

  • Increased turbine API pod startup probe tolerance to 60 seconds. Earlier, 10 seconds was insufficient startup tolerance on large clusters. This bugfix avoids the temporary appearance of a problem on startup by allowing more time for turbine API to load and report ready.

  • Fixed hdx-scaler pod discovery for intake-head and http-head deployments. Both deployments use app: stream-head as their label selector, causing the scaler to mix pods from both and produce incorrect per-deployment metrics. Added per-component Services that select on the component label so each scaler discovers pods through its own EndpointSlice.

  • Corrected handling of HKT / HKW config-report verbose cluster reporting output and crashes for same code when no namespace is supplied. Both were corner cases to a refactoring of software internals from the prior release.

  • Corrected consistency and round-tripping of Kubernetes resources into YAML rendered spec and back into a cluster. Earlier, information loss occurred in several places when processing code defaults, scale settings, pool configurations, and runtime-computed values. The collected fixes address classes of bugs. Specific examples of bugs fixed are: values in kubernetes_profile are no longer lost; targeting fields are now preserved in round-tripping; merge-i, -ii and -iii are correctly present or absent, depending on spec configuration; pool entries in a rendered spec no longer contain null scale fields.