Pod - Query Head

Overview⚓︎

Clusters typically run two instances of the query head application. At least one query head must be running in any cluster for the query subsystem to function.

The reverse proxy, traefik, directs incoming connections to Query interfaces in a query head pod.

Scaling choices depend on query load, data volume, and expected responsiveness.

In clusters configured with Query pools, the query head directs queries to the requested or default query pool.

Responsibilities⚓︎

At startup, the query head application

Verifies that the primary storage bucket is available TODO: all storage buckets?
Listens on network sockets
Awaits incoming queries over any configured Query Interfaces
Reloads the cluster config.json data file from primary storage every 5 seconds
Spawns a thread upon receiving a query

When handling a query received over any interface, the query head

Parses the incoming SQL query, generates an optimized plan
Constructs query options from cluster config, HTTP, and SQL SETTINGs clauses according to Query Options Precedence
Generates a query to the Catalog for relevant partitions
Retrieves available query peer instances for the query from Zookeeper
Hashes partition filenames for assignment to available query peers
Transmits the assignment to each query peer
Collects the intermediate results and completes the aggregation and sorting stage of query handling
Returns the query result to the client

Recoverable errors are handled according to configuration. For example, upon a query peer failure, a query head might assign the work to a different query peer if hdx_query_max_attempts has not been exceeded.

Upon non-recoverable errors the query head logs the failure and returns an informative error message to the client.

Deployment and scale⚓︎

	comment
Type	ReplicaSet in a Deployment
Replicas	default 2, minimum 1
CPU	14
Memory	36 GiB
Storage	default 5 GiB, recommend 50 GiB

Typical production scale shown here.

Storage inside the cluster is ephemeral. It provides local disk for work operations and optional query caching.

TODO: Are there recommended ratios for CPU, memory and storage?

On query heads, memory is used for accumulating query results until the limits set in hdx_query_max_bytes_before_external_group_by and hdx_query_max_bytes_before_external_sort are exceeded. Data operations then use disk for both GROUP BY and ORDER BY operations. If you are experiencing out-of-memory (OOM) conditions on a query head, set these values accordingly.

Scaling considerations⚓︎

Clients connect to the query head over some protocol. The query heads require sufficient resources to handle the final stages of query processing, aggregation and sorting, and all of the coordination with the Zookeeper and query peers.

Horizontal scaling is not usually necessary.

Vertical scaling can help if a query head encounters OOM conditions.

Autoscaling is not generally useful for the query system. Queries are a bursty workload and usually complete very quickly. Load-based signals are not a good indicator for autoscaling. Since scaling response time is measured in tens of seconds, queries have long failed or completed before any scaling changes can help.

Containers⚓︎

wait-for-zookeeper is an InitContainer
query-head runs the turbine_server which implements the in-cluster Query Interfaces

Important files⚓︎

/usr/bin/turbine_server - executable
/var/lib/turbine/turbine.ini - local configuration file

Startup dependencies⚓︎

Pod Zookeeper

Runtime dependencies⚓︎

Depends on Zookeeper for presence and absence of query-peer instances in each pool
Distributes work to each Query Peer
Consults Catalog in a read-only fashion for relevant partitions for every query
Requires the Config API for authenticating clients
Scrapes the config.json blob Storage path every 5 seconds for updates

Runtime provides⚓︎

Provides all externally-accessible Query Interfaces

Network listeners⚓︎

External⚓︎

External traffic is managed by the Traefik application router before passing to the query-head application.

When TLS is enabled (standard), the following sockets are open

tcp/9440 - traefik handles TLS, passes plaintext to pod tcp/9000, set port using tunable native_tls_port
tcp/8088 - traefik handles TLS, passes plaintext to pod tcp/8088, set port using tunable clickhouse_http_port

When TLS is disabled (non-standard), the following sockets are open

tcp/9000 - connected to ClickHouse native protocol, tunable native_port
tcp/8088 - connected to ClickHouse HTTP, set port using tunable clickhouse_http_port

When tunable disable_traefik_mysql_port is false

tcp/9004 - can be set with tunable mysql_port

Cluster internal⚓︎

tcp/9000 - ClickHouse native protocol (not configurable)
tcp/8088 - ClickHouse HTTP (not configurable)
tcp/9004 - MySQL protocol (not configurable)

These ports are reachable inside the cluster, for example the tooling pod.

HTTP APIs⚓︎

All on tcp/8088⚓︎

/query - HTTP Query API
/support - Hydrolix core system build-time diagnostic output
/metrics - Metrics
/ - ClickHouse HTTP API

Configuration⚓︎

Application configuration is split across multiple sources.

Query options⚓︎

Query options provide flexible control over circuit breakers, format selection, and output destination.

Customers set query options using Query Interfaces or the Config API. See Query Options Precedence and Query Options Reference.

The config.json data file contains org, project, and table query options.

The incoming query may contain HTTP headers or query parameters with query options.

The query head resolves the different sources of query options to construct the set that will be used for each handled query.

Hydrolix Config API⚓︎

The query head depends on the Config API to authenticate clients.

Cluster⚓︎

CONFIG_API_ENDPOINT - in-cluster http://turbine-api:3000/config
CACHE_SPACE_CAPACITY - from storage scaling
CONTAINER_MEMORY_LIMIT_BYTES - from memory scaling

Secrets⚓︎

general - CATALOG_DB_PASSWORD - Credentials used for communicating with the Catalog
general - CI_USER_PASSWORD - TODO: Why is this here?

Hydrolix tunables⚓︎

See Hydrolix Tunables List for detailed explanations.

These settings control interactions with storage buckets.

auth_http_read_timeout_ms
auth_http_response_timeout_ms
http_connect_timeout_ms
http_ssl_connect_timeout_ms
http_response_timeout_ms
http_read_timeout_ms
http_write_timeout_ms
max_http_retries
initial_exp_backoff_seconds
max_exp_backoff_seconds
exp_backoff_growth_factor_ms
exp_backoff_additive_jitter

Local disk⚓︎

These tunables influence local disk behavior.

disable_disk_cache
disk_cache_cull_start_perc
disk_cache_cull_stop_perc
disk_cache_redzone_start_perc
disk_cache_entry_max_ttl_minutes
io_perf_mappings

Application⚓︎

These settings set global limits and control the behavior of the query head.

enable_query_auth
default_query_pool
max_concurrent_queries
max_server_memory_usage
user_acl_refresh_interval_secs
user_token_refresh_interval_secs
user_token_expiration_secs
str_dict_enabled
str_dict_nr_threads
str_dict_min_dict_size

Network⚓︎

These settings control application configuration as a client of the Domain Name System (DNS).

use_hydrolix_dns_resolver
dns_server_ip
dns_aws_max_ttl_secs
dns_aws_max_resolution_attempts
dns_azure_max_ttl_secs
dns_azure_max_resolution_attempts
dns_gcs_max_ttl_secs
dns_gcs_max_resolution_attempts

Deprecated and other⚓︎

These are old, deprecated, or unused tunables.

use_https_with_s3 - DEPRECATED: Use db_bucket_url or db_bucket_http_enabled

Metrics⚓︎

See Query Metrics.

Key health metrics⚓︎

TODO: What are the most important metrics?

Other application metrics⚓︎

TODO: Are there any additional metrics that could fall into the "interesting" category?

Logs⚓︎

Logging output from the query head is handled by the Vector and can be found in two places

primary storage bucket path logs/query-head/query-head/<YYYY-MM-DD>/<timestamp>-<hash>.log.gz
in the Hydrologs, under the system project hydro and table logs

Effects of breakage⚓︎

The query head is a crucial aspect of query system availability in the cluster. If all query heads are failing, clients will be unable to query. They will receive network errors or timeouts.