v5.9.5

Notable new features⚓︎

Column-level access control⚓︎

Added column-level access control to the RBAC system. Data administrators can restrict access to a table by constructing a lists of blocked columns. These column policies can be attached to Hydrolix roles. All access policies are enforced at query execution time in the query system, enabling fine-grained data security for compliance and data governance requirements.

RBAC-enabled ingest and service endpoints⚓︎

Enable RBAC authorization on various endpoints using the tunable enable_traefik_authorization which is false by default. This covers the /ingest, /prometheus, /version, /grafana, /superset, and /kibana endpoints.

Parquet support⚓︎

Added support for Parquet format support in HTTP streaming, Kafka, Kinesis, and TCP ingestion systems, with transform configuration "format": "parquet" and HTTP header Content-Type: application/vnd.apache.parquet. Ingestion supports flattening, pointers, and pretransforms.

Intelligent Pod Scheduling during low resources⚓︎

Added Kubernetes PriorityClass for all workloads to enable intelligent pod scheduling during resource constraints. Critical workloads like intake-head can now preempt lower-priority workloads during traffic spikes, preventing data loss while new nodes provision. Includes new priority_classes tunable for overriding default priority assignments.

Enhanced cluster health monitoring⚓︎

Enhanced operator cluster health monitoring to automate post-upgrade checks and provide detailed status reporting. The HydrolixCluster status now includes clusterStatus (Ready/Not Ready/Upgrading/Scaled Off), categorized issues (critical vs non-critical), and health checks for all managed resources. Includes new tunables for configuring which resources to ignore during health evaluation.

Breaking changes⚓︎

Renamed `cluster_logs` endpoint and added RBAC⚓︎

Renamed cluster_logs Config API endpoint to cluster_spec with JSON response format and added RBAC permissions. Accounts with view_clusterspec, user_admin, or super_admin permission can access the endpoint.

Quesma has been renamed to Kibana Gateway⚓︎

Service name Occurrences of the name quesma have been renamed to kibana_gateway (or kibana-gateway, depending on context).

Tunable name The quesma_config tunable has been renamed to kibana_gateway_config.

Tunable schema The tunable schema remains unchanged except one new optional keyword. The version key was added to support custom image tag specification. For example:

spec: kibana_gateway_config: version: v2.0

Kibana Gateway upgrade instructions

Update the HydrolixCluster/hdx spec tunable name from quesma_config to kibana_gateway_config.
In the HydrolixCluster/hdx object, if custom scaling is defined using the spec.scale key, change the scaling key from quesma to kibana-gateway.
If external access for Quesma is enabled, change the subdomain in quesma.${HDX_HOSTNAME}.hydrolix.live to kibana-gateway.${HDX_HOSTNAME}.hydrolix.live.
After making these changes, ensure the Quesma pod is automatically terminated and Kibana Gateway pod is up and running. If you need help, contact Hydrolix support.

Upgrade instructions⚓︎

Upgrade and downgrade restrictions⚓︎

Do not skip minor versions when upgrading or downgrading

Skipping versions when upgrading or downgrading Hydrolix can result in database schema inconsistencies and cluster instability. Always upgrade or downgrade sequentially through each minor version.

Example:
Upgrade from 5.7.9 → 5.8.6 → 5.9.5, not 5.7.9 → 5.9.5.

Apply the new Hydrolix operator⚓︎

If you have a self-managed installation, apply the new operator directly with the kubectl command examples below. If you're using Hydrolix-supplied tools to manage your installation, follow the procedure prescribed by those tools.

GKE⚓︎

kubectl apply -f "https://www.hydrolix.io/operator/v5.9.5/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&gcp-storage-sa=${GCP_STORAGE_SA}"

EKS⚓︎

kubectl apply -f "https://www.hydrolix.io/operator/v5.9.5/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}&aws-storage-role=${AWS_STORAGE_ROLE}"

LKE and AKS⚓︎

kubectl apply -f "https://www.hydrolix.io/operator/v5.9.5/operator-resources?namespace=${HDX_KUBERNETES_NAMESPACE}"

Monitor the upgrade process⚓︎

Kubernetes jobs named init-cluster and init-turbine-api will automatically run to upgrade your entire installation to match the new operator's version number. This will take a few minutes, during which time you can observe your pods' restarts with your Kubernetes monitor tool.

Ensure both the init-cluster and init-turbine-api jobs have completed successfully and that the turbine-api pod has restarted without errors. After that, view the UI and use the API of your new installation as a final check.

If the turbine-api pod doesn't restart successfully, or other functionality is missing, check the logs of the init-cluster and init-turbine-api jobs for details about failures. This can be done using the k9s utility or with the kubectl command:

% kubectl logs -l app=init-cluster
% kubectl logs -l app=init-turbine-api

If you still need help, contact Hydrolix support.

Changelog⚓︎

Updates⚓︎

API updates⚓︎

Updated Python dependencies to include security fixes for CVE-2024-39330 (path traversal in django-storages) and CVE-2025-50181 (SSRF vulnerability in urllib3 via boto3).
- django: 5.0.14 → 5.2.8
- django-storages: 1.14.3 → 1.14.6
- boto3: 1.34.50 → 1.35.99
Gunicorn upgrade from 20.1.0 to 23.0.0.

Intake updates⚓︎

Upgraded Rust environment from 1.90.0 to 1.91.0 to ensure cargo fmt is available during build.

Cluster Operations updates⚓︎

Updated default HTTP proxy (Chproxy) version from 0.5.1 to 0.6.1.

Improvements⚓︎

Config API improvements⚓︎

Added Column-Level Access Control (CLAC) to the RBAC system, allowing per-table access policies.
Updated dictionary file upload endpoint help text to clarify that file must be a local path rather than URI. The filename field is now required.
Added query parameter filters to the /columns endpoint to support UI pagination for separate alias and additional names tables.
Email validation now ensures user email addresses meet Keycloak username requirements, preventing account creation failures.
Prevented deletion of dictionary files that are currently in use by dictionaries. This avoids query-head "dictionary file syncing failed" errors.
Added ability to reference credentials by name in addition to ID across all Config API endpoints. Supports both credential and credential_id fields on Storage, Table autoingest settings, and other credential references, enabling portable Configuration-as-Code workflows across clusters.
Added endpoint column_value_mapping for each table for assigning column values to storage locations. Using PUT and PATCH to manage these mappings avoids the risks of using PUT on a full table.
Renamed cluster_logs endpoint to cluster_spec with JSON response format and added RBAC permissions (super_admins, user_admins, and users with view_clusterspec permission can access).
Updated API documentation for credentials endpoint with type-specific schemas and added missing tags to endpoints with unlabeled responses.
Enabled service accounts to create other service accounts and issue tokens when they have appropriate RBAC permissions. This change removes previous restrictions that prevented service accounts from performing these operations.
Improved the sqlperms endpoint performance.
Updated default flush settings for new tables to optimize partition sizing and data retention. Hot data partition width reduced from 55 to 5 minutes for better query performance, and cold data max age reduced from 10 years to 1 year. Existing tables retain their configured settings.

Cluster Operations improvements⚓︎

Enhanced operator cluster health monitoring to automate post-upgrade checks and provide detailed status reporting. Cluster status and health checks are available for all managed resources.
Improved ZooKeeper connection handling with connection and command timeouts to increase responsiveness in the face of bad nodes or unresolvable paths. Also added more detail in the INFO-level logs for troubleshooting.
Added conditional Keycloak podAntiAffinity that applies only when replicas > 1, preventing scheduling issues in single-replica deployments.

Scale settings for intake-indexer sidecar in intake-head pools can now be overridden, allowing customization through the scale_profile configuration. For example:

spec:
  pools:
    intake-head-test:
      cpu: "1"
      name: intake-head-test
      replicas: "1"
      scale_profile: test
      service: intake-head
  scale:
    profile:
      test:
        intake-indexer:
          memory: 256Mi

Added debugging utilities to the tooling pod image, including networking tools (nc, telnet, traceroute, tcpdump), process debugging utilities (lsof, strace, htop), JSON tooling (jq), and Kubernetes interaction tools (kubectl, k9s).
Increased default memory allocation for turbine-api to 2Gi across all scale profiles, addressing increased memory requirements from v5.7 configuration changes.
Added host-based Traefik routing for Grafana using the <hostname>-grafana.<domain> URL pattern to facilitate smooth migration by the operator.
Added preferred pod anti-affinity rules and logic for ZooKeeper, RabbitMQ, Redpanda, and Traefik. This improves reliability by distributing pods across nodes while preserving custom node affinity configurations.
Added Kubernetes PriorityClasses for all workloads to enable intelligent pod scheduling during resource constraints. Critical workloads like intake-head can now preempt lower-priority workloads during traffic spikes, preventing data loss while new nodes provision. Includes new priority_classes tunable for overriding default priority assignments.
Enhanced kibana_gateway_config tunable to support multiple Hydrolix projects.
Added Subject Alternate Names (SANs) support for ACME-generated SSL certificates through new alt_names tunable. Enables a single certificate to cover multiple domain names, eliminating SSL errors when accessing services through different hostnames.
A new hdx-vpa-metrics service can offload VPA metric collection from HDX Scaler to a dedicated service, improving performance. Controlled by new hdx_vpa_metrics tunable with sub-keys for enabled, poll_interval, filter_monitored_pods, and metrics_port.
Enhanced HDX Scaler cooldown logic to only apply cooldowns when scaling actions are actually taken, preventing unnecessary delays between scale operations. Previously, cooldowns were applied even when no scaling occurred, which could delay subsequent scaling decisions.
Added events API permissions to the hdx-scaler role. Previously, the hdx-scaler service account lacked the necessary RBAC permissions to watch Kubernetes events, resulting in continuous 403 Forbidden errors being logged to CloudWatch audit logs. This increased CloudWatch costs.
Fixed the hdxscaler exponentially weighted moving average (EWMA) calculation producing incorrect, negative values under CPU saturation.
Changing hdxscaler settings no longer requires a restart.

Security improvements⚓︎

New RBAC authorization can be added to the /ingest, /prometheus, /version, /grafana, /superset, and /kibana endpoints. Enable this authorization using the tunable enable_traefik_authorization, which is False by default.
Added runAsNonRoot security context to all Kubernetes workloads at both container and pod levels.
Updated hdx-pg-monitor and hdx-scaler containers to run as a non-root turbine user.
Updated hdx-pod-metrics container to run as a non-root user.
Fixed security vulnerability where Traefik metrics endpoints were publicly accessible when ip_allowlist was set to 0.0.0.0/0. Metrics ports are now exposed through a separate internal traefik-metrics service, preventing unauthorized access while preserving Prometheus scraping capability.
Converted indexer sidecar to init-container in intake-head pods to ensure proper termination order during scale-down events. This prevents data loss by ensuring the indexer remains available until the stream container terminates.

Core improvements⚓︎

Added catalog_resp_time_ms to query_detail_runtime_stats for summary table queries, providing catalog read timing metrics previously only available for turbine storage queries.
Lowered socket receive_timeout in ClickHouse settings from 1,000 seconds to 20 seconds to improve delays in cancel responses.
Improved dictionary loading when dictionary files cannot be fetched. Turbine now logs errors and continues operating, allowing operations to work for tables that don't require the unavailable dictionary.
Added Phase 1 support for ClickHouse JSON data type. Includes read/write operations for JSON partitions and fixes for summary table compatibility.
Added observability columns to hdx.active_queries table including query_id, initial_query_id, memory_usage, peak_memory_usage, and host_addr. These additions enable query-level memory tracking and correlation of distributed query execution across query-head and query-peer nodes.

Intake improvements⚓︎

Enhanced cloud storage transfer error handling with retry logic for router communication failures and "target busy" responses, preventing data loss during transient failures.
intake-peers now clean up abandoned transfers from unexpectedly terminated senders.
Added initial Parquet format support to intake with transform configuration "format": "parquet" and HTTP header Content-Type: application/vnd.apache.parquet. Supports JSON gadgets (flattening, pointers, pretransforms).
Extended Parquet format support to Kafka, Kinesis, and TCP ingestion mechanisms. Also fixed issue reading compressed data from Kafka.
Optimized query performance on the catalog table by adding a unique key and composite index.

UI improvements⚓︎

Added API and Docs links to sidebar on pages where they were previously missing.
Updated Query Options UI to support the new unified hdx_query_max_before_external_group_by configuration option.
Added complete UI support for column policies management, enabling administrators to modify Column-Level Access Control (CLAC) policies.

Bug Fixes⚓︎

Config API fixes⚓︎

Fixed catalog_urls endpoint to return 400 Bad Request instead of 500 Internal Server Error when date parameters are incorrectly formatted.
Fixed Internal Server Error when deleting projects with orphaned jobs.
Table columns and row policies are now deleted when their parent table is deleted.
Added validation to prevent duplicate row policy names within the same table, returning 400 errors if duplicate naming is attempted.
Fixed password complexity error handling to return proper 400 errors instead of 500 Internal Server Error when passwords don't satisfy password complexity checks.
Fixed 500 Internal Server Error when updating query options on summary tables via the /query_options endpoint.
Fixed credential migration logic to ensure unique credential names, regardless of case.
Updated guardian migration from 0002 to 0003 in release file.
Fixed a bug which would occur after a catalog (PostgreSQL) failover or restart. turbine-api pods did not release stale database connections, causing connection pool exhaustion.

Cluster Operations fixes⚓︎

Removed REPLICATION permission from hdx-pg-monitor role, preventing permission-related errors during PostgreSQL monitoring operations.
Fixed a race condition where concurrent authentication requests could cause invalid token errors.
hdx_auth metrics is now disabled when unified_auth is disabled in Traefik plugin mode, preventing unnecessary metric collection.
Added PostgreSQL init-container to fix persistent volume mount permissions for non-root PostgreSQL containers.
Fixed key_prefix string formatting to correctly handle braces when using Python f-string format, preventing logs written to incorrect directories.
Fixed HDX Scaler to properly terminate orphaned scaler tasks when configuration is reloaded. Previously, removed scaler configurations would continue running, causing unexpected scaling behavior.
Fixed HDX Scaler VPA to no longer fail with "duplicate container name" errors when attempting to scale deployments using initContainers.
Fixed HDX Scaler Kubernetes watcher to properly reconnect when the connection to the Kubernetes API server is lost, preventing scaler failures and ensuring continuous monitoring of cluster resources.
ACME certificate renewal job now supports alternative names for SSL certificates. The acme-renewal can now perform certificate renewal for multi-domain configurations using this format in the Hydrolix spec:
1 2 3
alt_names: - test.hydrolix.dev - yourname.yourdomain.com
Fixed TLS configuration for MySQL and Thanos routes by disabling TLS passthrough mode, which was not functioning correctly. MySQL (port 9004) and Thanos-sidecar (port 19091) routes now use disable_tls instead of passthrough_tls, resolving connectivity issues to facilitate Tableau integration.

Core fixes⚓︎

Disallowed ORDER BY clauses in summary table SQL definitions, preventing configuration errors that silently caused missing columns during summary execution.
Fixed segmentation fault during query server shutdown.
Fixed deadlock occuring during query-peer restart during many conncurrent queries. QUERY_WAS_CANCELLED exceptions are now thrown to properly terminate.
Fixed query-peer crashes when operating with low OS file descriptor limits. Queries now fail gracefully with informative error messages.

Intake fixes⚓︎

Table name resolution now considers the project when resolving a table name to table ID, preventing incorrect table selection when table names are not unique across projects.
Fixed merge target overrides to persist on config reload, preventing overrides from being reset when configuration is refreshed.
Periodic file deletion now uses the filename path directly instead of converting to a string, ensuring deletion of log files.
Fixed batch job path construction in turbine-api to use correct base URL for all batch operations. This ensures batch job operations (commit, retry, cancel, status, errors) work correctly with legacy batch deployments.
Fixed a bug preventing the job_purge periodic task from deleting jobs with a NULL value for the updated_at field.

Operator fixes⚓︎

Added default priorityClass for http-head, intake-peer, and intake-router to be the same as intake-head.
Fixed hdx-scaler not reloading configuration related to metrics aggregations.

UI fixes⚓︎

Fixed Alter Job page table rendering to wait for data load completion, preventing null values from sometimes appearing in table columns after page reload.
Fixed delete modal text overflow when resource names are too long. The "delete" modal titles no longer overlap the close button.
Fixed transform validation page to eliminate excess API requests for sources, dictionaries, and functions endpoints by fetching only project-specific data. This mitigates "disappearing SQL" events in the UI.
Added validation to ensure the Parent Table field matches the table name in the SQL query when creating summary tables, displaying an error message for mismatches.
Added validation to prevent invalid dash character entry in number input fields across all UI forms. This prevents strings like "1-5" from being entered.
Transforms may now be used with Shadow Tables.
The bulk user invite form now correctly displays error messages when inviting duplicate users.
Fixed summary table edit form to clear non-field validation errors on submit, preventing previous error messages from blocking form resubmission.