v5.11.6
Notable new features⚓︎
Query head pools⚓︎
Clusters now support query head pools to allow definition of parallel, fully-isolated query infrastructure in a cluster. Earlier, all query pools and their clients shared the same query head. Now, both query head and query peer pools can be created and managed independently.
Query head pooling is supported for ClickHouse protocols. When the tunable query_head_pooling_enabled is enabled, incoming connections pass a protocol-specific reverse proxy, which selects the query head pool based on the ClickHouse database parameter of the incoming connection. Client applications must be updated with a database parameter to make use of this feature.
Routing rules in the HydrolixCluster spec direct connections to a specific query head pool.
The feature isn't enabled by default, but can be turned on with tunable query_head_pooling_enabled.
Autoscaler rewrite in Go⚓︎
The Hydrolix Kubernetes autoscaler has been rewritten in Go for efficiency, flexibility, and observability improvements. It can completely replace the Python-based hdx-scaler, which remains the default option. Existing scaling behavior is preserved, and the hdxscalers configuration in the HydrolixCluster spec still works the same way.
New capabilities in hdx-scaler-go:
- VPA for DaemonSets: Vertical pod autoscaling now supports DaemonSet workloads (set
workload_kind: DaemonSet), enabling resource right-sizing for components like cAdvisor. - Suspended and dry run modes: Pause scaling without losing accumulated state (
suspended: true), or observe what the scaler would do without applying changes (dry_run: true). - GOMEMLIMIT autoinjection: For Go services, the scaler can automatically set the
GOMEMLIMITenvironment variable as a percentage of the memory limit. Edit the VPAConfig custom resource directly to enable. - Validating webhooks: Custom Resource Definition (CRD) admission webhooks catch configuration errors at apply time.
Enable the new scaler with spec.scale.hdx-scaler.replicas: 0 and spec.scale.hdx-scaler-go.replicas: 1 in the HydrolixCluster spec. The Go scaler doesn't include a terminal UI. Use kubectl get hpaconfig -o wide and Prometheus metrics on :27183/metrics instead.
SQL validation on transforms and functions⚓︎
The Config API exposes a new /parse_sql/ API endpoint for validating SQL syntax in transforms, functions, and table merge settings. It catches syntax errors, such as missing commas, unmatched parentheses, or references to non-existent columns at configuration time rather than at query or ingest time. Validation errors return an HTTP 400 response and a description of what's wrong.
In a future release, endpoints that accept SQL for transforms, functions, and table merge settings will require the validation to succeed.
Breaking changes⚓︎
-
The
/usersendpoint no longer returns users and service accounts in a single response.The
/usersendpoint, however, provides an interim compatibility query parametersearch_service_accounts=true, which returns only service accounts. Client applications should ultimately switch to the dedicated/service_accountsendpoint instead. See also API endpoints for users and service accounts.Refactored the
/usersendpoint to separate users and service accounts. This corrects pagination that was previously broken for large userbases (1000+ Keycloak users). The endpoint now supports query parameters for filtering byemail,username,first_name,last_name,search, andexact. Keycloak-to-database user sync has been moved to a scheduled background task to eliminate race conditions on concurrent requests.Service accounts will be inaccessible from this endpoint in a subsequent release.
Deprecation notices⚓︎
-
merge-headis deprecated and will be removed in v6.0. If your configuration explicitly usesmerge-head, migrate tomerge-controllerbefore upgrading to v6.0.merge-controllerhas been the default since v5.10. -
Legacy internal user accounts
hdx@hydrolix.netandhdx.readonly@hydrolix.netare deprecated and will be removed in a future release.Starting in this release, cluster-internal automations use service accounts
internal.hdxandinternal.hdx.readonly. These will replace the legacy administrative user accounts.If you have external scripts, SIEM connectors, or integrations that authenticate using these legacy accounts, create a dedicated service account with the permissions your integration needs, issue a service account token, and configure the external service to use the new token. Automations should switch to using service accounts. See Manage Service Accounts.
See Audit reporting for internal accounts for guidance on discovering usage.
-
Introduced new table-specific endpoint
/bucket_settingsfor managing bothspread_listandcolumn_value_mapping. This replaces the separate/column_value_mappingendpoint, which is now deprecated.
Upgrade instructions (update the Hydrolix version number in the links below)⚓︎
Don't skip minor versions when upgrading or downgrading
Skipping versions when upgrading or downgrading Hydrolix can result in database schema inconsistencies and cluster instability. Always upgrade or downgrade sequentially through each minor version.
Example:
Upgrade from 5.7.9 → 5.8.6 → 5.9.5, not 5.7.9 → 5.9.5.
Apply the new Hydrolix operator⚓︎
If you have a self-managed installation, apply the new operator directly with these kubectl command examples. If you're using Hydrolix-supplied tools to manage your installation, follow the procedure prescribed by those tools.
GKE⚓︎
EKS⚓︎
LKE and AKS⚓︎
Monitor the upgrade process⚓︎
Kubernetes jobs named init-cluster and init-turbine-api will automatically run to upgrade your entire installation to match the new operator's version number. This will take a few minutes, during which time you can observe your pods' restarts with your Kubernetes monitor tool.
Ensure both the init-cluster and init-turbine-api jobs have completed successfully and that the turbine-api pod has restarted without errors. After that, view the UI and use the API of your new installation as a final check.
If the turbine-api pod doesn't restart successfully, or other functionality is missing, check the logs of the init-cluster and init-turbine-api jobs for details about failures. This can be done using the k9s utility or with the kubectl command:
If you still need help, contact Hydrolix support.
Changelog⚓︎
Updates⚓︎
Intake updates⚓︎
- Updated the Rust
object_storelibrary to version 0.11 → 0.12 and adapted code to the changed types. PR
Cluster operations updates⚓︎
-
Updated the following Python dependencies across internal services to address known security vulnerabilities:
- urllib3 v1 → v2: CVE-2025-66418, CVE-2025-66471, and CVE-2026-21441
- pyasn1 v0.6.1 → 0.6.2
- aiohttp v3.12.14 → v3.13.3: CVE-2025-69228 and CVE-2025-69223
- werkzeug v3.1.4 → v3.1.6: CVE-2026-27199
- flask v2.3.2 → v3.1.3: CVE-2026-27205
-
Upgraded the bundled chproxy-based
http-proxyto v0.6.2, adding support for query head pooling. -
Updated Rust dependencies across
hdx-gate,hdx-node-rs, andhdx-pod-metricsto address two critical HTTP request-smuggling vulnerabilities (CVE-2026-2833 and CVE-2026-2835) inpingora-core, updated to 0.8.0. -
Updated the default version of the Anomaly Detection LLM proxy. Corrects a bug in which the AWS Secret Access Keys were transmitted incorrectly in the
litellm_paramsblock instead of thesecuredblock.- janus v0.2.1 → v0.2.2
Config API updates⚓︎
- Updated third-party Python dependencies for the Config API:
- django 5.2.9 → 5.2.12
- django-guardian 3.2.0 → 3.3.0
- urllib3 2.6.2 → 2.6.3
- gunicorn[gevent] 23.0.0 → 25.1.0
- psycopg[binary,pool] 3.3.0 → 3.3.3
- pyjwt[crypto] 2.10.1 → 2.11.0
- prometheus-client 0.23.0 → 0.24.1
- werkzeug 3.1.4 → 3.1.6
Improvements⚓︎
Intake improvements⚓︎
-
Introduced a priority-based scheduling model with two tiers to prevent large bulk imports from starving latency-sensitive autoingest jobs. Now, the autoingest service enqueues jobs at the higher priority, and can skip expensive listing operations.
-
Refactored the indexing pipeline to separate responsibilities in the ingestion system. Summary table indexing is now logically independent of raw table indexing, allowing the former to be managed in a separate process. The improvements make the software more amenable to communication over RPC, opening up opportunity for performance improvements.
-
Batched catalog deletes in the reaper instead of issuing one per partition, reducing Postgres WAL pressure during large-scale reap operations. Also fixed a bug where failed reap events were incorrectly acknowledged instead of being returned to the queue.
-
Reduced memory consumption and improved task scheduling efficiency on clusters with many storage definitions.
Core improvements⚓︎
-
Added column
encoding_schemeto table metadata queries. This exposes the encoding scheme for each underlying table column. -
Added endpoint
/ping, identical to/healthcheck, to the ClickHouse HTTP server. It returnsOKif the server is up and healthy. -
Enabled
distributed_aggregation_memory_efficientby default for all queries, allowing the query head to spill GROUP BY aggregation states to disk when memory thresholds are exceeded. This reduces the risk of OOM errors on large analytical queries.
Query improvements⚓︎
-
Added parameterized query support on the HTTP query interface. Pass
param_-prefixed parameters (for example,param_user_id=123) that substitute into{name:Type}placeholders in SQL queries to improve query security. -
Added streaming query support to the HTTP Query API. Clients can now specify query option
hdx_query_streaming_result=trueas an HTTP header or query parameter. The server no longer materializes the entire query result, which decreases the time to first byte (TTFB) of the response data. TheX-Hdx-Query-Statsare emitted as the last line of the response body. -
Added dynamic query peer availability polling and refined partition assignment strategy in the query head. Instead of assigning all partitions immediately, the query head assigns work in a rolling fashion to take advantage of fast and newly-available peers. By decreasing batch sizes assigned to query peers, memory usage is easier to predict. This should reduce out-of-memory (OOM) events in query infrastructure.
Cluster operations improvements⚓︎
-
Added feature to render HDX custom resource specifications exhaustively, including defaults. Prior to this feature, existing outputs lacked empty fields which were translated to defaults during cluster operation. This feature allows external infrastructure management systems to detect all configuration changes.
-
Each workload connecting to the catalog database now uses its own dedicated Postgres user named after the workload (for example:
merge_head,query_peer,batch_head), replacing the sharedquery_apiuser. This makes it easier to identify per-component connection counts and query activity inpg_stat_activityduring troubleshooting. Thequery_apiuser is retained for rollback compatibility but is no longer actively used. -
Rewrote the ACME certificate management service in Go, replacing the previous bash-based implementation. The service now runs as a persistent deployment with automatic renewal, retry with backoff on transient failures, and support for wildcard certificates and subject alternative names (SANs).
-
Added a
traefik_read_timeouttunable to handle slow data sources. If the timeout is reached, Traefik closes the connection and responds with an HTTP 499 Client Closed Request error. Default value is60(seconds). -
Introduced a
deschedulerpod that evaluates node utilization and periodically evicts pods from underutilized nodes so the scheduler can repack them onto busier nodes. This helps free up empty nodes during downscaling to manage cloud costs. The pod runs the Kubernetes SIG Descheduler (v0.35.0). -
Introduced an optional
kube-schedulerwith two strategies for influencing node assignment. WithMostAllocated, services are assigned to minimize node count to control costs. StrategyLeastAllocatedspreads services across available nodes. Services can be assigned to this non-default, optional scheduler. -
Added
defaultQueryTimeRangeand per-indexqueryTimeRangeconfiguration properties to Kibana Gateway. The default query time range is24h, with support for table-level overrides using index settings.
Config API improvements⚓︎
-
Added new organization-, project-, and table-level circuit breaker query options
hdx_query_max_perc_before_external_sortandhdx_query_max_memory_usage_perc. Use these to limit percentage of memory devoted to in-memory sorting before spilling to disk and maximum allowable memory, instead of using the similarly named query options which accept bytes. -
Implemented support for Kafka SASL Salted Challenge Response Authentication Mechanism credential type. Known as
kafka_sasl_scram, in-cluster clients can now authenticate to Kafka servers requiringSCRAM-SHA-256andSCRAM-SHA-512SASL mechanisms. -
Added SQL validation for transforms, functions, and table merge settings using the
hdx_verify_sql()function. Prepares the Config API to validates SQL syntax at creation and update time in a future release, catching errors before they reach query execution. -
Dictionary file API responses now include a
sha256hash alongside the existing cloud-specificetag. The SHA256 hash is consistent across cloud providers, improving change detection in multi-cloud deployments. Existing dictionary files will shownullfor thesha256field until re-uploaded.
UI improvements⚓︎
- Improved the "All" option in the Log Level dropdown on the System Health page to select all individual log levels rather than applying an "All" label. Users can now deselect individual levels after selecting "All" for more granular filtering.
Bug fixes⚓︎
Core fixes⚓︎
-
Partition locks and generated partitions are now properly cleaned up when an alter job fails or is canceled. Previously, stale locks from a failed alter job would persist indefinitely, blocking subsequent alter operations on the same partitions, and couldn't be removed by the prune-locks job because they still had associated catalog rows.
-
Fixed a
denullify: truebug on complex types in summary tables. Some subelements withdenullifyenabled weren't propagated to summary tables, causing type mismatch errors during data ingestion. -
Fixed a bug in calculating disk full danger percentage. Earlier, the red zone was incorrectly calculated as only 1 + normal cache cull threshold, 76% instead of the expected 90%.
API fixes⚓︎
-
Enforced shard key validation upon creation using SQL identifier validation rules. Earlier, invalid column names were incorrectly accepted as shard keys.
-
Expanded conditions under which alter jobs can be canceled to include failed jobs. Earlier, a failed job had to be retried before it could be canceled.
-
Added workaround for default recursive listing when creating storages on Azure buckets. When many files exist in a bucket, the Config API could time out awaiting a response from Azure's
list_objectswithout delimiter support. This fix uses Azure'swalk_blobswhich accepts delimiters. -
Ignored
uuidin request bodies, by declaring it read-only. Earlier, a difference between the UUID in the path and the body resulted in surprising behavior. Now,uuidinPOSTis ignored and the actual UUID is generated by the system. Auuidin aPUTorPATCHmethod is also ignored. -
Updated table properties
memory_coefficientandtoken_auth_enabledto acceptnullvalue on PUT and PATCH requests. This aligns withnullbeing the default on table creation. -
Enforced
shard_keyimmutability on PUT requests. Previously, a PUT that omitted theshard_keyfield, either by excluding settings entirely or sending an emptysettings: {}, could silently clear the existing value. -
Improved error handling for invalid SQL during summary table creation and updates. Previously, empty or malformed SQL in the
settings.summary.sqlfield caused a 500 Internal Server Error instead of returning a 400 Bad Request with a descriptive validation message. -
Resolved a 500 error on the
/usersendpoint whenexclude_pending=falseon clusters using Google OAuth. The user serializer assumed ausernameattribute was always present, which isn't the case for OAuth-provisioned users. -
Updating a project's
deployment_idto its default (autogenerated) value is now idempotent. The uniqueness validation was incorrectly flagging the project's own currentdeployment_idas a duplicate. -
Improved logic for sending and deleting invites. Earlier, race conditions between multiple simultaneous invites could result in HTTP 500 errors. Also, allowed pending invites to be deleted even if a job exists or the user is protected. Protected users are never deleted.
-
Restored appropriate permissions allowing authorized users to resend invites, create bulk invites, and claim invites. Earlier, under some circumstances, users who should have been authorized were receiving HTTP 403.
-
Allowed autoingest table PATCH requests to accept
nullforsource_credentialandbucket_credential. When null is provided, the credentials now fall back to cluster defaults, matching the behavior on table creation. Previously, passing null for these fields caused a validation error. -
Allowed the catalog upload endpoint to accept blank, null, or omitted
shard_keyvalues. In all three cases, the API now coerces the value to the default42bc986dc5eec4d3(meaning "no particular value") instead of rejecting the request with an HTTP 400 error. -
The API now returns an HTTP 400 Bad Request error when attempting to delete a project that has a pending alter job. Previously, the delete returned a 500 Internal Server Error from an uncaught exception, and the alter job status was changed to
Canceled. -
Added validation to prevent setting
suppress,virtual, orignoreon primary timestamp fields. A migration automatically corrects any existing transforms or templates with these invalid settings on upgrade. -
Fixed the
/config/v1/root endpoint to return 403 Forbidden instead of 500 Internal Server Error when accessed without sufficient permissions. -
Fixed a SIEM-related database migration that could cause
init-turbine-apito fail during cluster upgrade if a dynamic secret had not been provisioned. -
Improved the migration speed of SIEM data source
access_detailstocredentialsby caching necessary secrets during migration and suppressing the usual publication until complete. The one-time migration now completes quickly with fewer resources. -
Removed race condition in which an OAuth token issued before a password reset is accepted. Now, the Config API correctly rejects tokens if the password has changed since the token was issued.
-
Corrected an RBAC migration script to create permission
ingest_tableif it doesn't exist prior to the permissions adjustment. -
Fixed an error preventing new clusters from sending initial invitation emails. Validation requirements for API users differ from administrative users with Kubernetes access. Now, the latter can send invitations without validation.
-
Changed the account used when executing the
hdx_verify_sqltable function in the/config/v1/parse_sqlendpoint. Earlier, the requesting account's permissions were checked when calling into the separate container. This would fail if the user lacked theselect_sqlpermission. Now, any authenticated user can call the endpoint. -
Expanded the validation for Kinesis checkpointers to include GCP URLs. Earlier, only AWS ARNs were accepted.
-
Limited the protected table list to the
hydroproject. Earlier, tables namedaudit_logs,monitor, andlogswere inadvertently protected and undeletable in other projects. -
Retired usage of an ancient statisticial package, in favor of simple inline implementations for several functions used by the presigned URL code. This avoids an unnecessary dependency.
Intake fixes⚓︎
-
Fixed the
akamai_siem_source_request_error_countPrometheus metric for Akamai SIEM ingest sources to correctly register with the requiredreturn_codeanderrorlabels. -
Corrected purge handling for all jobs. Formerly, some cancelled and failed jobs would have
NULLin theupdated_atcolumn, invisible to the purge logic. First,updated_atis set in all job management code paths. Second, the purge query relies on thecreated_atvariable whenupdated_atisNULLfor cleaning up orphaned tasks. Thejob_purge_agedefault is also now 48 hours instead of 90 days. -
Corrected reporting, handling, and logging of failed tasks in the batch controller, which manifested as a stuck batch job. Now, any task failure causes the job to fail.
-
Fixed a regression introduced during batch system refactoring. The bug caused the reporting API to return incorrect values for
estimatedwhen handling alter jobs. -
Resolved a race condition in intake head shutdown sequence that could cause a panic when in-flight HTTP handlers attempted to write to an already-closed channel. A two-phase shutdown coordination now ensures all active handlers complete before channel cleanup runs.
-
Fixed the name of a protocol in the protocols list. The
rawprotocol was formerly not discoverable. -
Resolved a bug where summary table ingestion could fail with a Missing columns error when a column referenced by the summary transform was suppressed in the raw table. The failure was latent; it only surfaced when an unrelated schema change invalidated a stale internal cache used during partition key computation.
-
Fixed empty partitions being incorrectly treated as errors.
UI fixes⚓︎
-
Fixed the storage and user edit pages in the UI to wait for fresh data before rendering the form. This prevents cached data from overwriting fresh data.
-
Suppressed the 10,000-row limit warning in the query editor for INSERT INTO statements. Previously, the warning appeared for all queries, including INSERT statements that don't return result sets.
-
Improved error handling in the UI when deleting a storage that's still referenced in a table's storage map. The UI now surfaces the API's descriptive error message instead of showing a generic "Something went wrong" dialog.
-
Fixed the Users and Pending Invites lists in the UI to automatically refresh after deleting a user or invite. Previously, deleted entries remained visible until the user switched tabs or manually refreshed the page.
-
The UI now revalidates cluster version data on login to avoid serving stale cached values. Before this change, the version number was missing after logging in until the page was manually refreshed.
-
Corrected the breadcrumb navigation to display the SIEM name instead of showing a missing or incorrect label when viewing a SIEM configuration page.
-
Fixed the Column Policies page to display a column's current name consistently. Previously, when a column had an additional name set as current, the page showed the original field name in the selection list but the additional name in the blocked list.
-
The Service Accounts list page now displays assigned roles. Also fixed incorrect role display when navigating pagination.
Cluster operations fixes⚓︎
-
Added missing
apiUrlproperty to the Kibana Gateway instance and set the default software version to1.1.23. This allows the Kibana Gateway software to reach the cluster's Config API. -
Removed static validations of scaling fields and application targets in the HDX Autoscaler. Now uses the centrally-defined fields for Horizontal (HPA) and Vertical Pod Autoscaling (VPA) and the scale settings. This corrects validation discrepancies.
-
Explicitly set
limit_cpuin the definition of the operator's own KubernetesDeploymentobject to avoid a corner case of definition order. Other settings likeovercommitaren't subject to this problem. -
Fixed incorrect type validation for
kubernetes_premium_storage_classandkubernetes_storage_classtunables. -
Gave each init job (
backup-keycloak-db,cert-expiry-check,check-bucket-access,init-turbine-api,load-sample-project,wait-for-db-access) its own scale profile instead of sharing theinit-clusterscale. Previously, settingspec.scale.init-turbine-api.memory: 1Giin the cluster spec had no effect because each job's scale lookup key was hardcoded toinit-clusterin the operator, so only changes tospec.scale.init-clusterwere applied. -
Increased the
scale_minmemory floor forinit-turbine-apifrom the inherited default of 512Mi to 2Gi. This reduced out-of-memory (OOM) kills andCrashLoopBackOfferrors whenscale_min: truewas set on the cluster. -
Fixed false validation errors in HDX Scaler configuration (for example, per_pod reported as invalid, intake-router reported as an unknown service). The validator used a static list of services and settings that was incomplete; it now uses actual service names generated by the operator.
-
Corrected two race conditions in IP and service management. Prevent service list and IP desynchronization by always refetching all EndpointSlices whenever the service list changes and correctly handling DELETE events.
-
Increased turbine API pod startup probe tolerance to 60 seconds. Earlier, 10 seconds was insufficient startup tolerance on large clusters. This bugfix avoids the temporary appearance of a problem on startup by allowing more time for turbine API to load and report ready.
-
Fixed
hdx-scalerpod discovery forintake-headandhttp-headdeployments. Both deployments useapp: stream-headas their label selector, causing the scaler to mix pods from both and produce incorrect per-deployment metrics. Added per-component Services that select on thecomponentlabel so each scaler discovers pods through its own EndpointSlice. -
Corrected handling of HKT / HKW config-report verbose cluster reporting output and crashes for same code when no namespace is supplied. Both were corner cases to a refactoring of software internals from the prior release.
-
Corrected consistency and round-tripping of Kubernetes resources into YAML rendered spec and back into a cluster. Earlier, information loss occurred in several places when processing code defaults, scale settings, pool configurations, and runtime-computed values. The collected fixes address classes of bugs. Specific examples of bugs fixed are: values in
kubernetes_profileare no longer lost;targetingfields are now preserved in round-tripping;merge-i,-iiand-iiiare correctly present or absent, depending on spec configuration; pool entries in a rendered spec no longer containnullscale fields.