Pod - Query Head
Overview⚓︎
Clusters typically run two instances of the query head application. At least one query head must be running in any cluster for the query subsystem to function.
The reverse proxy, traefik, directs incoming connections to Query interfaces in a query head pod.
Scaling choices depend on query load, data volume, and expected responsiveness.
In clusters configured with Query pools, the query head directs queries to the requested or default query pool.
Responsibilities⚓︎
At startup, the query head application
- Verifies that the primary storage bucket is available TODO: all storage buckets?
- Listens on network sockets
- Awaits incoming queries over any configured Query Interfaces
- Reloads the cluster config.json data file from primary storage every 5 seconds
- Spawns a thread upon receiving a query
When handling a query received over any interface, the query head
- Parses the incoming SQL query, generates an optimized plan
- Constructs query options from cluster config, HTTP, and SQL SETTINGs clauses according to Query Options Precedence
- Generates a query to the Catalog for relevant partitions
- Retrieves available query peer instances for the query from Zookeeper
- Hashes partition filenames for assignment to available query peers
- Transmits the assignment to each query peer
- Collects the intermediate results and completes the aggregation and sorting stage of query handling
- Returns the query result to the client
Recoverable errors are handled according to configuration. For example, upon a query peer failure, a query head might assign the work to a different query peer if hdx_query_max_attempts has not been exceeded.
Upon non-recoverable errors the query head logs the failure and returns an informative error message to the client.
Deployment and scale⚓︎
| comment | |
|---|---|
| Type | ReplicaSet in a Deployment |
| Replicas | default 2, minimum 1 |
| CPU | 14 |
| Memory | 36 GiB |
| Storage | default 5 GiB, recommend 50 GiB |
Typical production scale shown here.
Storage inside the cluster is ephemeral. It provides local disk for work operations and optional query caching.
TODO: Are there recommended ratios for CPU, memory and storage?
On query heads, memory is used for accumulating query results until the limits set in hdx_query_max_bytes_before_external_group_by and hdx_query_max_bytes_before_external_sort are exceeded. Data operations then use disk for both GROUP BY and ORDER BY operations. If you are experiencing out-of-memory (OOM) conditions on a query head, set these values accordingly.
Scaling considerations⚓︎
Clients connect to the query head over some protocol. The query heads require sufficient resources to handle the final stages of query processing, aggregation and sorting, and all of the coordination with the Zookeeper and query peers.
Horizontal scaling is not usually necessary.
Vertical scaling can help if a query head encounters OOM conditions.
Autoscaling is not generally useful for the query system. Queries are a bursty workload and usually complete very quickly. Load-based signals are not a good indicator for autoscaling. Since scaling response time is measured in tens of seconds, queries have long failed or completed before any scaling changes can help.
See also - Scale Profiles - Scale your Cluster - Custom Scale Profiles
Containers⚓︎
wait-for-zookeeperis an InitContainerquery-headruns theturbine_serverwhich implements the in-cluster Query Interfaces
Important files⚓︎
/usr/bin/turbine_server- executable/var/lib/turbine/turbine.ini- local configuration file
Startup dependencies⚓︎
Runtime dependencies⚓︎
- Depends on Zookeeper for presence and absence of
query-peerinstances in each pool - Distributes work to each Query Peer
- Consults Catalog in a read-only fashion for relevant partitions for every query
- Requires the Config API for authenticating clients
- Scrapes the config.json blob Storage path every 5 seconds for updates
Runtime provides⚓︎
- Provides all externally-accessible Query Interfaces
Network listeners⚓︎
External⚓︎
External traffic is managed by the Traefik application router before passing to the query-head application.
When TLS is enabled (standard), the following sockets are open
- tcp/9440 -
traefikhandles TLS, passes plaintext to pod tcp/9000, set port using tunablenative_tls_port - tcp/8088 -
traefikhandles TLS, passes plaintext to pod tcp/8088, set port using tunableclickhouse_http_port
When TLS is disabled (non-standard), the following sockets are open
- tcp/9000 - connected to ClickHouse native protocol, tunable
native_port - tcp/8088 - connected to ClickHouse HTTP, set port using tunable
clickhouse_http_port
When tunable disable_traefik_mysql_port is false
- tcp/9004 - can be set with tunable
mysql_port
Cluster internal⚓︎
- tcp/9000 - ClickHouse native protocol (not configurable)
- tcp/8088 - ClickHouse HTTP (not configurable)
- tcp/9004 - MySQL protocol (not configurable)
These ports are reachable inside the cluster, for example the tooling pod.
HTTP APIs⚓︎
All on tcp/8088⚓︎
/query- HTTP Query API/support- Hydrolix core system build-time diagnostic output/metrics- Metrics/- ClickHouse HTTP API
Configuration⚓︎
Application configuration is split across multiple sources.
Query options⚓︎
Query options provide flexible control over circuit breakers, format selection, and output destination.
Customers set query options using Query Interfaces or the Config API. See Query Options Precedence and Query Options Reference.
The config.json data file contains org, project, and table query options.
The incoming query may contain HTTP headers or query parameters with query options.
The query head resolves the different sources of query options to construct the set that will be used for each handled query.
Hydrolix Config API⚓︎
The query head depends on the Config API to authenticate clients.
Cluster⚓︎
CONFIG_API_ENDPOINT- in-cluster http://turbine-api:3000/configCACHE_SPACE_CAPACITY- from storage scalingCONTAINER_MEMORY_LIMIT_BYTES- from memory scaling
Secrets⚓︎
general-CATALOG_DB_PASSWORD- Credentials used for communicating with the Cataloggeneral-CI_USER_PASSWORD- TODO: Why is this here?
Hydrolix tunables⚓︎
See Hydrolix Tunables List for detailed explanations.
Storage related⚓︎
These settings control interactions with storage buckets.
auth_http_read_timeout_msauth_http_response_timeout_mshttp_connect_timeout_mshttp_ssl_connect_timeout_mshttp_response_timeout_mshttp_read_timeout_mshttp_write_timeout_msmax_http_retriesinitial_exp_backoff_secondsmax_exp_backoff_secondsexp_backoff_growth_factor_msexp_backoff_additive_jitter
Local disk⚓︎
These tunables influence local disk behavior.
disable_disk_cachedisk_cache_cull_start_percdisk_cache_cull_stop_percdisk_cache_redzone_start_percdisk_cache_entry_max_ttl_minutesio_perf_mappings
Application⚓︎
These settings set global limits and control the behavior of the query head.
enable_query_authdefault_query_poolmax_concurrent_queriesmax_server_memory_usageuser_acl_refresh_interval_secsuser_token_refresh_interval_secsuser_token_expiration_secsstr_dict_enabledstr_dict_nr_threadsstr_dict_min_dict_size
Network⚓︎
These settings control application configuration as a client of the Domain Name System (DNS).
use_hydrolix_dns_resolverdns_server_ipdns_aws_max_ttl_secsdns_aws_max_resolution_attemptsdns_azure_max_ttl_secsdns_azure_max_resolution_attemptsdns_gcs_max_ttl_secsdns_gcs_max_resolution_attempts
Deprecated and other⚓︎
These are old, deprecated, or unused tunables.
use_https_with_s3- DEPRECATED: Usedb_bucket_urlordb_bucket_http_enabled
Metrics⚓︎
See Query Metrics.
Key health metrics⚓︎
TODO: What are the most important metrics?
Other application metrics⚓︎
TODO: Are there any additional metrics that could fall into the "interesting" category?
Logs⚓︎
Logging output from the query head is handled by the Vector and can be found in two places
- primary storage bucket path
logs/query-head/query-head/<YYYY-MM-DD>/<timestamp>-<hash>.log.gz - in the Hydrologs, under the system project
hydroand tablelogs
Effects of breakage⚓︎
- The query head is a crucial aspect of query system availability in the cluster. If all query heads are failing, clients will be unable to query. They will receive network errors or timeouts.