Prometheus Integration

The Hydrolix stack includes Prometheus, an open-source metrics database. Hydrolix continuously updates its Prometheus instance with metrics information. You can query, view, and actively monitor this information through the use of a stack's Grafana instance, or you can access it via your own monitoring platform.

Using Prometheus Directly

Prometheus has its own web-based UI, available by visiting https://YOUR-HYDROLIX-HOSTNAME.hydrolix.live/prometheus in your web browser.

This view is far more basic than Grafana's, suitable for quickly entering queries and seeing simple, graphed results. Hydrolix does make this feature available immediately, without any additional setup.

Using a Separate Prometheus Server

How It Works

Rather than using the built-in Prometheus server to display and report metrics, you can use an external Prometheus server. To do this, configure the internal server to forward metrics to your external server and enable remote writing on the external server.

This solution uses Prometheus' Remote Write Server functionality to link the two servers.

πŸ“˜

Both servers are doing work

Even though this uses a separate external Prometheus server, Hydrolix's internal Prometheus server still uses memory and CPU. It collects and aggregates metrics, then forwards the data to your external Prometheus server for query.

Steps

  1. Tell Hydrolix to send data to the external Prometheus server. Include this line in the spec section of your hydrolixcluster.yaml file. This example assumes your external Prometheus server is running at the default port 9090, and that firewalls allow traffic on that port:

    spec:
    ...
      prometheus_remote_write_url: http://<prometheus server hostname>:9090/api/v1/write
    
  2. Run the external Prometheus sever with the --web.enable-remote-write-receiver switch.

  3. If the external Prometheus installation uses basic auth, set the username in your hydrolixcluster.yaml file and set the password in a curated secret.

    Edit the hydrolixcluster.yaml file to add one line:

    spec:
    ...
      prometheus_remote_write_username: <username>
    

    Apply this change to the Hydrolix cluster:

    kubectl -f hydrolixcluster.yaml apply
    

    Create a file named prom-secret.yaml with these contents:

    ---
    apiVersion: v1
    kind: Secret
    metadata:
      name: curated
      namespace: $HDX_KUBERNETES_NAMESPACE
    stringData:
      PROMETHEUS_REMOTE_WRITE_PASSWORD: <password>
    type: Opaque
    

    Finally, use the Kubernetes command line tool (kubectl) to interpolate the $HDX_KUBERNETES_NAMESPACE variable and apply the generated secret to your Kubernetes cluster:

    eval "echo \"$(cat prom-secret.yaml)\"" > secrets.yaml
    kubectl apply -f prom-secret.yaml
    

    πŸ“˜

    Not the same as Prometheus Remote Read/Write

    Hydrolix can also serve as the database for a Prometheus installation, providing longer retention and cost savings for large volumes of data. The settings for that feature are very similarly named to these settings, and can be easily confused.

Hydrolix's Metrics

This table lists the metrics available, and which components update them.

If more than one component uses a given metric, then querying it will return results from all relevant components. You can restrict results to a specific component by adding a service keyword to your query, e.g. "process_open_fds{service="stream-peer"}".

For more information about metric types, refer to Prometheus's documentation.

General Metrics

These metrics track various counters and statistics regarding data ingestion.

MetricTypeComponentsPurpose
bytes_writtenCounterBatch peer, Stream peerBytes written to the indexer.
partitions_createdCounterBatch peer, Stream peerCount of partitions created.
process_cpu_seconds_totalCounterBatch peer, Stream head, Stream peerTotal user and system CPU time spent in seconds.
process_max_fdsGaugeBatch peer, Stream head, Stream peerMaximum number of open file descriptors.
process_open_fdsGaugeBatch peer, Stream head, Stream peerNumber of open file descriptors.
process_resident_memory_bytesGaugeBatch peer, Stream head, Stream peerResident memory size in bytes.
process_start_time_secondsGaugeBatch peer, Stream head, Stream peerStart time of the process since unix epoch in seconds.
process_virtual_memory_bytesGaugeBatch peer, Stream head, Stream peerVirtual memory size in bytes.
process_virtual_memory_max_bytesGaugeBatch peer, Stream head, Stream peerMaximum amount of virtual memory available in bytes.
promhttp_metric_handler_requests_in_flightGaugeBatch peer, Stream head, Stream peerCurrent number of scrapes being served.
promhttp_metric_handler_requests_totalCounterBatch peer, Stream head, Stream peerTotal number of scrapes by HTTP status code.
upload_durationSummaryAny intake peerTime spent uploading a file, in milliseconds

Query Metrics

These metrics track activity specific to batch ingestions.

MetricTypeComponentsPurpose
net_connect_attempts_totalHistogramHead/Query peerHistogram of TCP connection attempted to storage service
net_connect_secondsHistogramHead/Query peerHistogram of time to connect over TCP to storage service in seconds
net_dns_resolve_secondsHistogramHead/Query peerHistogram of DNS resolution time to storage service in seconds.
net_http_response_timeHistogramHead/Query peerHistogram HTTP response time to storage service in seconds
net_http_response_bytesHistogramHead/Query peerHistogram of HTTP bytes downloaded from the storage service
net_http_attempts_totalHistogramHead/Query peerHistogram of HTTP connection attempted to storage service
net_http_status_codeHistogramHead/Query peerHistogram of HTTP status code result from storage service
vfs_cache_hitmiss_totalHistogramHead/Query peerHistogram of cache status if bucket = 0 cache miss, and 1 cache hit
vfs_cache_read_bytesHistogramHead/Query peerHistogram bytes read from cache
vfs_net_read_bytesHistogramHead/Query peerHistogram bytes read from network
vfs_cache_lru_file_eviction_totalHistogramHead/Query peerHistogram cache eviction of files
epoll_cpu_secondsHistogramHead/Query peerHistogram CPU used in seconds
epoll_io_secondsHistogramHead/Query peerHistogram I/O in seconds
epoll_poll_secondsHistogramHead/Query peerHistogram wait for file descriptor in seconds
hdx_storage_r_catalog_partitions_totalHistogramHead/Query peerHistogram of per query catalog partition count
hdx_storage_r_partitions_read_totalHistogramHead/Query peerHistogram of per query partition read count
hdx_storage_r_partitions_per_core_totalHistogramHead/Query peerHistogram of per core partition used count
hdx_storage_r_peers_used_totalHistogramQuery peerHistogram of storage used total
hdx_storage_r_cores_used_totalHistogramQuery peerHistogram of Cores used total
hdx_storage_r_catalog_timerangeHistogramHead/Query peerHistogram of query time range distribution
hdx_partition_columns_read_totalHistogramHead/Query peerHistogram of column read
hdx_partition_block_decode_secondsHistogramHead/Query peerHistogram of time spent decoding hdx blocks in seconds
hdx_partition_open_secondsHistogramHead/Query peerHistogram of time spent opening hdx partition in seconds
hdx_partition_read_secondsHistogramHead/Query peerHistogram of time spent reading hdx partition in seconds
hdx_partition_skipped_totalHistogramHead/Query peerHistogram of partition skip count due to no matching columns
hdx_partition_blocks_read_totalHistogramHead/Query peerHistogram of partition read count
hdx_partition_blocks_avail_totalHistogramHead/Query peerHistogram of partition blocks available
hdx_partition_index_decisionHistogramHead/Query peerHistogram of partition decision if bucket = 0 fullscan, 1 partial scan and 2 no match
hdx_partition_index_lookup_secondsHistogramHead/Query peerHistogram of index lookup in seconds
hdx_partition_index_blocks_skipped_percentHistogramHead/Query peerHistogram of skipped index blocked in percentage
hdx_partition_index_blocks_skipped_totalHistogramHead/Query peerHistogram of skipped index blocked in total
hdx_partition_rd_w_err_totalHistogramHead/Query peerHistogram of errors if bucket = 0 read error, 1 written error and 3 error
query_iowait_secondsHistogramHead/Query peerHistogram query IO wait in seconds
query_cpuwait_secondsHistogramHead/Query peerHistogram query cpu wait in seconds
query_hdx_ch_conv_secondsHistogramHead/Query peerHistogram of time spent converting hdx blocks to clickhouse in seconds
query_healthHistogramHead/Query peerHistogram of query health if bucket = 0 initiated error, 1 succeeded and 2 error
query_peer_availabilityHistogramHead/Query peerHistogram of query peer availability if bucket = 0 primary_peer_available, 1 secondary_peer_available and 2 no_reachable_peers
query_attempts_totalHistogramHead/Query peerHistogram of query attempts total
query_response_secondsHistogramHead/Query peerHistogram of query response total in seconds
query_rows_read_totalHistogramHead/Query peerHistogram of query rows read total
query_read_bytesHistogramHead/Query peerHistogram of query read bytes total
query_rows_written_totalHistogramHead/Query peerHistogram of query rows written total

Batch Metrics

These metrics track activity specific to batch ingestion.

MetricTypeComponentsPurpose
processed_countCounterBatch peerCount of items processed.
processed_failureCounterBatch peerCount of processing failures.
processing_duration_histoHistogramBatch peerHistogram of Batch processing durations in milliseconds.
processing_duration_summarySummaryBatch peerSummary of Batch processing durations in milliseconds.
rows_readCounterBatch peerCount of rows read.

Merge Metrics

These metrics correspond to Hydrolix's merge service.

MetricTypeComponentsPurpose
merge_duration_summarySummaryMerge peerMerge processing duration, in milliseconds.
merge_duration_histoHistogramMerge peerMerge processing duration, in milliseconds.
merge_sdk_duration_summarySummaryMerge peerMerge SDK processing duration, in milliseconds.
merge_sdk_duration_histoHistogramMerge peerMerge SDK processing duration, in milliseconds.
merge_candidate_histoHistogramMerge peerPartitions per merge candidate.
merge_candidate_inactiveCounterMerge peerMerge candidates skipped due to an inactive partition within the candidate
merge_candidate_construction_summarySummaryMerge headTime spent building merge candidates, in milliseconds.
merge_queue_fullCounterMerge headTimes candidate generation was skipped due to a full queue
merge_successCounterMerge peerCount of merge successes.
merge_failureCounterMerge peerCount of merge successes.

Rabbit MQ

https://www.rabbitmq.com/prometheus.html

Streaming Metrics

HTTP Stream Ingest

These metrics are specific to the use of streaming data sources.

MetricTypeComponentsPurpose
hdx_sink_backlog_bytes_countGaugeIntake headTotal bytes of all partition buckets in sink backlog waiting to be indexed. Only produced when intake_head_index_backlog_enabled is true.
hdx_sink_backlog_items_countGaugeIntake headTotal count of partition buckets in sink backlog waiting to be indexed. Only produced when intake_head_index_backlog_enabled is true.
hdx_sink_backlog_dropped_bytes_countCounterIntake headTotal bytes of partition buckets dropped due to backlog growing too big. Only produced when intake_head_index_backlog_enabled is true.
hdx_sink_backlog_dropped_items_countCounterIntake headCount of partition buckets dropped due to backlog growing too big. Only produced when intake_head_index_backlog_enabled is true.
hdx_sink_backlog_delivery_countCounterIntake headCount of backlog buckets successfully handed off to indexing. Only produced when intake_head_index_backlog_enabled is true.
hdx_sink_backlog_trim_duration_nsHistogramIntake headTime to trim the backlog in nanoseconds. Only produced when intake_head_index_backlog_enabled is true.
http_source_byte_countCounterStream headCount of bytes processed.
http_source_request_countCounterStream headCount of http requests.
http_source_request_duration_nsHistogramStream headA histogram of HTTP request durations in nanoseconds.
http_source_request_error_countCounterStream headCount of http request failures.
http_source_row_countCounterStream headCount of rows processed.
http_source_value_countCounterStream headCount of values processed.
kinesis_source_byte_countCounterStream peerCount of bytes read from Kinesis.
kinesis_source_checkpoint_countCounterStream peerCount of Kinesis checkpoint operations.
kinesis_source_checkpoint_duration_nsHistogramStream peerDuration of Kinesis checkpoint operations.
kinesis_source_checkpoint_error_countCounterStream peerCount of errors in Kinesis checkpoint operations.
kinesis_source_error_countCounterStream peerCount of errors in Kinesis source reads.
kinesis_source_lag_msGaugeStream peerMeasure of lag in Kinesis source.
kinesis_source_operation_countCounterStream peerCount of operations on Kinesis.
kinesis_source_operation_duration_nsHistogramStream peerHistogram of duration of operations on Kinesis.
kinesis_source_record_countCounterStream peerCount of records read from Kinesis.
kinesis_source_row_countCounterStream peerCount of rows read from Kinesis.
kinesis_source_value_countCounterStream peerCount of values read from Kinesis.

Redpanda

https://docs.redpanda.com/docs/cluster-administration/monitoring/

Kafka Ingest

These metrics are specific to the use of Kafka data sources.

MetricTypeComponentsPurpose
kafka_source_byte_countCounterStream peerCount of bytes read from Kafka.
kafka_source_commit_duration_nsHistogramStream peerKafka commit duration.
kafka_source_read_countCounterStream peerCount of Kafka reads.
kafka_source_read_duration_nsHistogramStream peerKafka read duration.
kafka_source_read_error_countCounterStream peerCount of Kafka errors.
kafka_source_row_countCounterStream peerCount of rows processed.
kafka_source_value_countCounterStream peerCount of values processed.

DNS Metrics

These metrics are specific to the use of Kafka data sources.

MetricTypeComponentsPurpose
dns_num_ips_in_cacheHistogram(ingest)The size of the IP pool used in the DNS system.
dns_lookup_timeHistogram(ingest)Milliseconds per lookup.
dns_ttlHistogram(ingest)TTLs received per lookup.

Go Environment Metrics

These metrics track resources used by Hydrolix's Go environments.

MetricTypeComponentsPurpose
go_gc_duration_secondsSummaryBatch peer, Stream head, Stream peerA summary of the pause duration of garbage collection cycles.
go_goroutinesGaugeBatch peer, Stream head, Stream peerNumber of goroutines that currently exist.
go_infoGaugeBatch peer, Stream head, Stream peerInformation about the Go environment.
go_memstats_alloc_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes allocated and still in use.
go_memstats_alloc_bytes_totalCounterBatch peer, Stream head, Stream peerTotal number of bytes allocated, even if freed.
go_memstats_buck_hash_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes used by the profiling bucket hash table.
go_memstats_frees_totalCounterBatch peer, Stream head, Stream peerTotal number of frees.
go_memstats_gc_cpu_fractionGaugeBatch peer, Stream head, Stream peerThe fraction of this program's available CPU time used by the GC since the program started.
go_memstats_gc_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes used for garbage collection system metadata.
go_memstats_heap_alloc_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes allocated and still in use.
go_memstats_heap_idle_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes waiting to be used.
go_memstats_heap_inuse_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes that are in use.
go_memstats_heap_objectsGaugeBatch peer, Stream head, Stream peerNumber of allocated objects.
go_memstats_heap_released_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes released to OS.
go_memstats_heap_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes obtained from system.
go_memstats_last_gc_time_secondsGaugeBatch peer, Stream head, Stream peerNumber of seconds since 1970 of last garbage collection.
go_memstats_lookups_totalCounterBatch peer, Stream head, Stream peerTotal number of pointer lookups.
go_memstats_mallocs_totalCounterBatch peer, Stream head, Stream peerTotal number of mallocs.
go_memstats_mcache_inuse_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes in use by mcache structures.
go_memstats_mcache_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes used for mcache structures obtained from system.
go_memstats_mspan_inuse_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes in use by mspan structures.
go_memstats_mspan_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes used for mspan structures obtained from system.
go_memstats_next_gc_bytesGaugeBatch peer, Stream head, Stream peerNumber of heap bytes when next garbage collection will take place.
go_memstats_other_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes used for other system allocations.
go_memstats_stack_inuse_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes in use by the stack allocator.
go_memstats_stack_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes obtained from system for stack allocator.
go_memstats_sys_bytesGaugeBatch peer, Stream head, Stream peerNumber of bytes obtained from system.
go_threadsGaugeBatch peer, Stream head, Stream peerNumber of OS threads created.

PostgreSQL Pool Metrics

MetricTypePurpose
pgx_pool_total_acquire_countCountThe cumulative count of successful acquires from the pool.
pgx_pool_total_acquire_duration_ns_countCountThe total duration of all successful acquires from the pool.
pgx_pool_total_acquire_cancel_countCountThe cumulative count of acquires from the pool that were canceled by a context.
pgx_pool_total_acquire_empty_countCountThe cumulative count of successful acquires from the pool that waited for a resource to be released or constructed because the pool was empty.
pgx_pool_total_conns_opened_countCountThe cumulative count of new connections opened.
pgx_pool_total_destroyed_max_lifetime_countCountThe cumulative count of connections destroyed because they exceeded MaxConnLifetime.
pgx_pool_total_destroyed_max_idle_countCountThe cumulative count of connections destroyed because they exceeded MaxConnIdleTime.
pgx_pool_current_sizeGaugeThe total number of resources currently in the pool.
pgx_pool_current_constructingGaugeThe number of connections with construction in progress in the pool.
pgx_pool_current_acquiredGaugeThe number of currently acquired connections in the pool.
pgx_pool_current_idleGaugeThe number of currently idle connections in the pool.
pgx_pool_maxGaugeThe maximum size of the pool.