Observability

Your Hydrolix stack includes Prometheus, an open-source metrics database. While your stack runs, Hydrolix continuously updates its Prometheus instance with metrics information. You can query, view, and actively monitor this information through the use of your stack's Grafana instance, after performing a one-time setup.

Using Grafana

Since version 2.10.12 your Grafana instance comes up with default sources built-in:

  • hdx-monitoring-prometheus pointing to Prometheus service and used in our monitoring dashboard
  • hdx-monitoring-query pointing to Hydrolix cluster and used in our monitoring dashboard
  • hdx-query which is the default data source you can use to created dashboard from your Hydrolix Cluster.

If you are using a previous version of Hydrolix you can either upgrade or follow the next step to connect your stack's Prometheus instance as a data source into its Grafana instance.

Preparing the data source

  1. Visit your stack's Grafana instance at https://YOUR-HYDROLIX-HOSTNAME.hydrolix.live/grafana.

  2. Select ⚙️ Configuration from the left menu bar, and then select Data Sources.

  3. Click the Add data source button.

  4. Select Prometheus from the list of compatible data sources.

  5. Enter the name hdx-monitoring-prometheus in the field.

  6. Enter https://YOUR-HYDROLIX-HOSTNAME.hydrolix.live/prometheus as the data source's URL.

Entering your Prometheus instance's URL.Entering your Prometheus instance's URL.

Finally, click Save & Test. The message "Data source is working" should appear immediately, completing this setup.

If you do not see that success message, confirm that your provided URL is correct, and that your stack has at least one prometheus service running. If you still have trouble getting Grafana to connect to Prometheus, please contact Hydrolix support.

Importing Monitoring Dashboard

Hydrolix has deployed built-in monitoring dashboard for our different services in Grafana community dashboard.

You can easily add those dashboards in your Grafana deployment:

  1. Visit your stack's Grafana instance at https://YOUR-HYDROLIX-HOSTNAME.hydrolix.live/grafana.

  2. Select + Create from the left menu bar, and then select Import.

  3. Specify Hydrolix Dashboard ID and click on Load

  4. Hydrolix has several dashboards ID deployed:

    1. 14443 For HTTP Streaming Ingest Monitoring
    2. 14444 For Kafka Ingest Monitoring
    3. 14446 For Merge Service Monitoring
    4. 14447 For Overall System Metrics like CPU, Disk etc
    5. 14442 For Table / Partitions Monitoring
    6. 14543 For an Overview of your Hydrolix Cluster
    7. 14846 For setting up Alert on the different component
    8. 15042 For Query Performance Monitoring

Entering your dashboard ID for ImportEntering your dashboard ID for Import

When you import the dashboard it'll requires to select the datasources you should use:

  • hdx-monitoring-prometheus for prometheus
  • hdx-monitoring-query for clickhouse (only for the Table Partition monitoring).

Based on your deployment you may want specific dashboards or all of them.

Querying Grafana

After setting up Prometheus as a data source, you can query, graph, and monitor your Hydrolix metrics through all the tools and techniques Grafana makes available.

For a simple example, select 🧭 Explore from Grafana's left menu bar, and then enter a basic metric-query such as process_open_fds into the text entry field. This results in a multi-line graph showing the open file descriptors in use by several of your stack's components.

To turn this static graph into a dynamic monitor, select 5s from the pull-down menu next to the Run query button. The graph then refreshes itself every five seconds.

Viewing a simple metric query.Viewing a simple metric query.

For a complete list of the metrics that Hydrolix makes available, see Hydrolix's Metrics.

For more information on using Grafana and Prometheus together, you may consult Prometheus's documentation on that topic.

Using Prometheus directly

Prometheus has its own web-based UI, available by visiting https://YOUR-HYDROLIX-HOSTNAME.hydrolix.live/prometheus in your web browser.

This view is far more basic than Grafana's, suitable for quickly entering queries and seeing simple, graphed results. Hydrolix does make this feature available immediately, without any additional setup.

Hydrolix's metrics

This table lists the metrics available, and which components update them.

If more than one component uses a given metric, then querying it will return results from all relevant components. You can restrict results to a specific component by adding a service keyword to your query, e.g. "process_open_fds{service="stream-peer"}".

For more information about metric types, refer to Prometheus's documentation.

General metrics

These metrics track various counters and statistics regarding data ingestion.

MetricTypeComponentsPurpose
bytes_writtenCounterBatch peer, Stream peerBytes written to the indexer.
partitions_createdCounterBatch peer, Stream peerCount of partitions created.
process_cpu_seconds_totalCounterBatch peer, Stream head, Stream peerTotal user and system CPU time spent in seconds.
process_max_fdsGaugeBatch peer, Stream head, Stream peerMaximum number of open file descriptors.
process_open_fdsGaugeBatch peer, Stream head, Stream peerNumber of open file descriptors.
process_resident_memory_bytesGaugeBatch peer, Stream head, Stream peerResident memory size in bytes.
process_start_time_secondsGaugeBatch peer, Stream head, Stream peerStart time of the process since unix epoch in seconds.
process_virtual_memory_bytesGaugeBatch peer, Stream head, Stream peerVirtual memory size in bytes.
process_virtual_memory_max_bytesGaugeBatch peer, Stream head, Stream peerMaximum amount of virtual memory available in bytes.
promhttp_metric_handler_requests_in_flightGaugeBatch peer, Stream head, Stream peerCurrent number of scrapes being served.
promhttp_metric_handler_requests_totalCounterBatch peer, Stream head, Stream peerTotal number of scrapes by HTTP status code.

Query metrics

These metrics track activity specific to batch ingestions.

MetricTypeComponentsPurpose
net_connect_attempts_totalHistogramHead/Query peerHistogram of TCP connection attempted to storage service
net_connect_secondsHistogramHead/Query peerHistogram of time to connect over TCP to storage service in seconds
net_dns_resolve_secondsHistogramHead/Query peerHistogram of DNS resolution time to storage service in seconds.
net_http_response_timeHistogramHead/Query peerHistogram HTTP response time to storage service in seconds
net_http_response_bytesHistogramHead/Query peerHistogram of HTTP bytes downloaded from the storage service
net_http_attempts_totalHistogramHead/Query peerHistogram of HTTP connection attempted to storage service
net_http_status_codeHistogramHead/Query peerHistogram of HTTP status code result from storage service
vfs_cache_hitmiss_totalHistogramHead/Query peerHistogram of cache status if bucket = 0 cache miss, and 1 cache hit
vfs_cache_read_bytesHistogramHead/Query peerHistogram bytes read from cache
vfs_net_read_bytesHistogramHead/Query peerHistogram bytes read from network
vfs_cache_lru_file_eviction_totalHistogramHead/Query peerHistogram cache eviction of files
epoll_cpu_secondsHistogramHead/Query peerHistogram CPU used in seconds
epoll_io_secondsHistogramHead/Query peerHistogram I/O in seconds
epoll_poll_secondsHistogramHead/Query peerHistogram wait for file descriptor in seconds
hdx_storage_r_catalog_partitions_totalHistogramHead/Query peerHistogram of per query catalog partition count
hdx_storage_r_partitions_read_totalHistogramHead/Query peerHistogram of per query partition read count
hdx_storage_r_partitions_per_core_totalHistogramHead/Query peerHistogram of per core partition used count
hdx_storage_r_peers_used_totalHistogramQuery peerHistogram of storage used total
hdx_storage_r_cores_used_totalHistogramQuery peerHistogram of Cores used total
hdx_storage_r_catalog_timerangeHistogramHead/Query peerHistogram of query time range distribution
hdx_partition_columns_read_totalHistogramHead/Query peerHistogram of column read
hdx_partition_block_decode_secondsHistogramHead/Query peerHistogram of time spent decoding hdx blocks in seconds
hdx_partition_open_secondsHistogramHead/Query peerHistogram of time spent opening hdx partition in seconds
hdx_partition_read_secondsHistogramHead/Query peerHistogram of time spent reading hdx partition in seconds
hdx_partition_skipped_totalHistogramHead/Query peerHistogram of partition skip count due to no matching columns
hdx_partition_blocks_read_totalHistogramHead/Query peerHistogram of partition read count
hdx_partition_blocks_avail_totalHistogramHead/Query peerHistogram of partition blocks available
hdx_partition_index_decisionHistogramHead/Query peerHistogram of partition decision if bucket = 0 fullscan, 1 partial scan and 2 no match
hdx_partition_index_lookup_secondsHistogramHead/Query peerHistogram of index lookup in seconds
hdx_partition_index_blocks_skipped_percentHistogramHead/Query peerHistogram of skipped index blocked in percentage
hdx_partition_index_blocks_skipped_totalHistogramHead/Query peerHistogram of skipped index blocked in total
hdx_partition_rd_w_err_totalHistogramHead/Query peerHistogram of errors if bucket = 0 read error, 1 written error and 3 error
query_iowait_secondsHistogramHead/Query peerHistogram query IO wait in seconds
query_cpuwait_secondsHistogramHead/Query peerHistogram query cpu wait in seconds
query_hdx_ch_conv_secondsHistogramHead/Query peerHistogram of time spent converting hdx blocks to clickhouse in seconds
query_healthHistogramHead/Query peerHistogram of query health if bucket = 0 initiated error, 1 succeeded and 2 error
query_peer_availabilityHistogramHead/Query peerHistogram of query peer availability if bucket = 0 primary_peer_available, 1 secondary_peer_available and 2 no_reachable_peers
query_attempts_totalHistogramHead/Query peerHistogram of query attempts total
query_response_secondsHistogramHead/Query peerHistogram of query response total in seconds
query_rows_read_totalHistogramHead/Query peerHistogram of query rows read total
query_read_bytesHistogramHead/Query peerHistogram of query read bytes total
query_rows_written_totalHistogramHead/Query peerHistogram of query rows written total

Batch metrics

These metrics track activity specific to batch ingestions.

MetricTypeComponentsPurpose
processed_countCounterBatch peerCount of items processed.
processed_failureCounterBatch peerCount of processing failures.
processing_duration_histoHistogramBatch peerHistogram of Batch processing durations in milliseconds.
processing_duration_summarySummaryBatch peerSummary of Batch processing durations in milliseconds.
rows_readCounterBatch peerCount of rows read.

Merge metrics

These metrics correspond to Hydrolix's merge service.

MetricTypeComponentsPurpose
merge_duration_summarySummaryMerge peerMerge processing duration, in milliseconds.
merge_duration_histoHistogramMerge peerMerge processing duration, in milliseconds.
merge_sdk_duration_summarySummaryMerge peerMerge SDK processing duration, in milliseconds.
merge_sdk_duration_histoHistogramMerge peerMerge SDK processing duration, in milliseconds.
merge_candidate_histoHistogramMerge peerPartitions per merge candidate.
merge_successCounterMerge peerCount of merge successes.
merge_failureCounterMerge peerCount of merge successes.

Streaming metrics

HTTP Stream Ingest

These metrics are specific to the use of streaming data sources.

MetricTypeComponentsPurpose
http_source_byte_countCounterStream headCount of bytes processed.
http_source_request_countCounterStream headCount of http requests.
http_source_request_duration_nsHistogramStream headA histogram of HTTP request durations in nanoseconds.
http_source_request_error_countCounterStream headCount of http request failures.
http_source_row_countCounterStream headCount of rows processed.
http_source_value_countCounterStream headCount of values processed.
kinesis_source_byte_countCounterStream peerCount of bytes read from Kinesis.
kinesis_source_checkpoint_countCounterStream peerCount of Kinesis checkpoint operations.
kinesis_source_checkpoint_duration_nsHistogramStream peerDuration of Kinesis checkpoint operations.
kinesis_source_checkpoint_error_countCounterStream peerCount of errors in Kinesis checkpoint operations.
kinesis_source_error_countCounterStream peerCount of errors in Kinesis source reads.
kinesis_source_lag_msGaugeStream peerMeasure of lag in Kinesis source.
kinesis_source_operation_countCounterStream peerCount of operations on Kinesis.
kinesis_source_operation_duration_nsHistogramStream peerHistogram of duration of operations on Kinesis.
kinesis_source_record_countCounterStream peerCount of records read from Kinesis.
kinesis_source_row_countCounterStream peerCount of rows read from Kinesis.
kinesis_source_value_countCounterStream peerCount of values read from Kinesis.

Kafka Ingest

These metrics are specific to the use of Kafka data sources.

MetricTypeComponentsPurpose
kafka_source_byte_countCounterStream peerCount of bytes read from Kafka.
kafka_source_commit_duration_nsHistogramStream peerKafka commit duration.
kafka_source_read_countCounterStream peerCount of Kafka reads.
kafka_source_read_duration_nsHistogramStream peerKafka read duration.
kafka_source_read_error_countCounterStream peerCount of Kafka errors.
kafka_source_row_countCounterStream peerCount of rows processed.
kafka_source_value_countCounterStream peerCount of values processed.

Go environment metrics

These metrics track resources used by Hydrolix's Go environments.

MetricTypeComponentsPurpose
go_gc_duration_secondsSummaryBatch peer, Stream head, Stream peerA summary of the pause duration of garbage collection cycles.
go_goroutinesGaugeBatch peer, Stream head, Stream peerNumber of goroutines that currently exist.
go_infoGaugeBatch peer, Stream head, Stream peerInformation about the Go environment.
go_memstats_alloc_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes allocated and still in use.
go_memstats_alloc_bytes_totalCounterBatch peer, Stream head, Stream peerTotal number of bytes allocated, even if freed.
go_memstats_buck_hash_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes used by the profiling bucket hash table.
go_memstats_frees_totalCounterBatch peer, Stream head, Stream peerTotal number of frees.
go_memstats_gc_cpu_fractionGaugeBatch peer, Stream head, Stream peerThe fraction of this program's available CPU time used by the GC since the program started.
go_memstats_gc_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes used for garbage collection system metadata.
go_memstats_heap_alloc_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes allocated and still in use.
go_memstats_heap_idle_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes waiting to be used.
go_memstats_heap_inuse_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes that are in use.
go_memstats_heap_objectsGaugeBatch peer, Stream head, Stream peerNumber of allocated objects.
go_memstats_heap_released_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes released to OS.
go_memstats_heap_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes obtained from system.
go_memstats_last_gc_time_secondsGaugeBatch peer, Stream head, Stream peerNumber of seconds since 1970 of last garbage collection.
go_memstats_lookups_totalCounterBatch peer, Stream head, Stream peerTotal number of pointer lookups.
go_memstats_mallocs_totalCounterBatch peer, Stream head, Stream peerTotal number of mallocs.
go_memstats_mcache_inuse_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes in use by mcache structures.
go_memstats_mcache_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes used for mcache structures obtained from system.
go_memstats_mspan_inuse_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes in use by mspan structures.
go_memstats_mspan_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes used for mspan structures obtained from system.
go_memstats_next_gc_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes when next garbage collection will take place.
go_memstats_other_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes used for other system allocations.
go_memstats_stack_inuse_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes in use by the stack allocator.
go_memstats_stack_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes obtained from system for stack allocator.
go_memstats_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes obtained from system.
go_threadsGaugeBatch peer, Stream head, Stream peerNumber of OS threads created.

Did this page help you?