Prometheus Integration
The Hydrolix stack includes Prometheus, an open-source metrics database. Hydrolix continuously updates its Prometheus instance with metrics information. You can query, view, and actively monitor this information through the use of a stack's Grafana instance, or you can access it via your own monitoring platform.
Using Prometheus Directly
Prometheus has its own web-based UI, available by visiting https://YOUR-HYDROLIX-HOSTNAME.hydrolix.live/prometheus
in your web browser.
This view is far more basic than Grafana's, suitable for quickly entering queries and seeing simple, graphed results. Hydrolix does make this feature available immediately, without any additional setup.
Using a Separate Prometheus Server
How It Works
Rather than using the built-in Prometheus server to display and report metrics, you can use an external Prometheus server. To do this, configure the internal server to forward metrics to your external server and enable remote writing on the external server.
This solution uses Prometheus' Remote Write Server functionality to link the two servers.
Both servers are doing work
Even though this uses a separate external Prometheus server, Hydrolix's internal Prometheus server still uses memory and CPU. It collects and aggregates metrics, then forwards the data to your external Prometheus server for query.
Steps
-
Tell Hydrolix to send data to the external Prometheus server. Include this line in the
spec
section of your hydrolixcluster.yaml file. This example assumes your external Prometheus server is running at the default port 9090, and that firewalls allow traffic on that port:spec: ... prometheus_remote_write_url: http://<prometheus server hostname>:9090/api/v1/write
-
Run the external Prometheus sever with the
--web.enable-remote-write-receiver
switch. -
If the external Prometheus installation uses basic auth, set the username in your hydrolixcluster.yaml file and set the password in a curated secret.
Edit the hydrolixcluster.yaml file to add one line:
spec: ... prometheus_remote_write_username: <username>
Apply this change to the Hydrolix cluster:
kubectl -f hydrolixcluster.yaml apply
Create a file named prom-secret.yaml with these contents:
--- apiVersion: v1 kind: Secret metadata: name: curated namespace: $HDX_KUBERNETES_NAMESPACE stringData: PROMETHEUS_REMOTE_WRITE_PASSWORD: <password> type: Opaque
Finally, use the Kubernetes command line tool (
kubectl
) to interpolate the$HDX_KUBERNETES_NAMESPACE
variable and apply the generated secret to your Kubernetes cluster:eval "echo \"$(cat prom-secret.yaml)\"" > secrets.yaml kubectl apply -f prom-secret.yaml
Not the same as Prometheus Remote Read/Write
Hydrolix can also serve as the database for a Prometheus installation, providing longer retention and cost savings for large volumes of data. The settings for that feature are very similarly named to these settings, and can be easily confused.
Hydrolix's Metrics
This table lists the metrics available, and which components update them.
If more than one component uses a given metric, then querying it will return results from all relevant components. You can restrict results to a specific component by adding a service
keyword to your query, e.g. "process_open_fds{service="stream-peer"}
".
For more information about metric types, refer to Prometheus's documentation.
General Metrics
These metrics track various counters and statistics regarding data ingestion.
Metric | Type | Components | Purpose |
---|---|---|---|
bytes_written | Counter | Batch peer, Stream peer | Bytes written to the indexer. |
partitions_created | Counter | Batch peer, Stream peer | Count of partitions created. |
process_cpu_seconds_total | Counter | Batch peer, Stream head, Stream peer | Total user and system CPU time spent in seconds. |
process_max_fds | Gauge | Batch peer, Stream head, Stream peer | Maximum number of open file descriptors. |
process_open_fds | Gauge | Batch peer, Stream head, Stream peer | Number of open file descriptors. |
process_resident_memory_bytes | Gauge | Batch peer, Stream head, Stream peer | Resident memory size in bytes. |
process_start_time_seconds | Gauge | Batch peer, Stream head, Stream peer | Start time of the process since unix epoch in seconds. |
process_virtual_memory_bytes | Gauge | Batch peer, Stream head, Stream peer | Virtual memory size in bytes. |
process_virtual_memory_max_bytes | Gauge | Batch peer, Stream head, Stream peer | Maximum amount of virtual memory available in bytes. |
promhttp_metric_handler_requests_in_flight | Gauge | Batch peer, Stream head, Stream peer | Current number of scrapes being served. |
promhttp_metric_handler_requests_total | Counter | Batch peer, Stream head, Stream peer | Total number of scrapes by HTTP status code. |
upload_duration | Summary | Any intake peer | Time spent uploading a file, in milliseconds |
Query Metrics
These metrics track activity specific to batch ingestions.
Metric | Type | Components | Purpose |
---|---|---|---|
net_connect_attempts_total | Histogram | Head/Query peer | Histogram of TCP connection attempted to storage service |
net_connect_seconds | Histogram | Head/Query peer | Histogram of time to connect over TCP to storage service in seconds |
net_dns_resolve_seconds | Histogram | Head/Query peer | Histogram of DNS resolution time to storage service in seconds. |
net_http_response_time | Histogram | Head/Query peer | Histogram HTTP response time to storage service in seconds |
net_http_response_bytes | Histogram | Head/Query peer | Histogram of HTTP bytes downloaded from the storage service |
net_http_attempts_total | Histogram | Head/Query peer | Histogram of HTTP connection attempted to storage service |
net_http_status_code | Histogram | Head/Query peer | Histogram of HTTP status code result from storage service |
vfs_cache_hitmiss_total | Histogram | Head/Query peer | Histogram of cache status if bucket = 0 cache miss, and 1 cache hit |
vfs_cache_read_bytes | Histogram | Head/Query peer | Histogram bytes read from cache |
vfs_net_read_bytes | Histogram | Head/Query peer | Histogram bytes read from network |
vfs_cache_lru_file_eviction_total | Histogram | Head/Query peer | Histogram cache eviction of files |
epoll_cpu_seconds | Histogram | Head/Query peer | Histogram CPU used in seconds |
epoll_io_seconds | Histogram | Head/Query peer | Histogram I/O in seconds |
epoll_poll_seconds | Histogram | Head/Query peer | Histogram wait for file descriptor in seconds |
hdx_storage_r_catalog_partitions_total | Histogram | Head/Query peer | Histogram of per query catalog partition count |
hdx_storage_r_partitions_read_total | Histogram | Head/Query peer | Histogram of per query partition read count |
hdx_storage_r_partitions_per_core_total | Histogram | Head/Query peer | Histogram of per core partition used count |
hdx_storage_r_peers_used_total | Histogram | Query peer | Histogram of storage used total |
hdx_storage_r_cores_used_total | Histogram | Query peer | Histogram of Cores used total |
hdx_storage_r_catalog_timerange | Histogram | Head/Query peer | Histogram of query time range distribution |
hdx_partition_columns_read_total | Histogram | Head/Query peer | Histogram of column read |
hdx_partition_block_decode_seconds | Histogram | Head/Query peer | Histogram of time spent decoding hdx blocks in seconds |
hdx_partition_open_seconds | Histogram | Head/Query peer | Histogram of time spent opening hdx partition in seconds |
hdx_partition_read_seconds | Histogram | Head/Query peer | Histogram of time spent reading hdx partition in seconds |
hdx_partition_skipped_total | Histogram | Head/Query peer | Histogram of partition skip count due to no matching columns |
hdx_partition_blocks_read_total | Histogram | Head/Query peer | Histogram of partition read count |
hdx_partition_blocks_avail_total | Histogram | Head/Query peer | Histogram of partition blocks available |
hdx_partition_index_decision | Histogram | Head/Query peer | Histogram of partition decision if bucket = 0 fullscan, 1 partial scan and 2 no match |
hdx_partition_index_lookup_seconds | Histogram | Head/Query peer | Histogram of index lookup in seconds |
hdx_partition_index_blocks_skipped_percent | Histogram | Head/Query peer | Histogram of skipped index blocked in percentage |
hdx_partition_index_blocks_skipped_total | Histogram | Head/Query peer | Histogram of skipped index blocked in total |
hdx_partition_rd_w_err_total | Histogram | Head/Query peer | Histogram of errors if bucket = 0 read error, 1 written error and 3 error |
query_iowait_seconds | Histogram | Head/Query peer | Histogram query IO wait in seconds |
query_cpuwait_seconds | Histogram | Head/Query peer | Histogram query cpu wait in seconds |
query_hdx_ch_conv_seconds | Histogram | Head/Query peer | Histogram of time spent converting hdx blocks to clickhouse in seconds |
query_health | Histogram | Head/Query peer | Histogram of query health if bucket = 0 initiated error, 1 succeeded and 2 error |
query_peer_availability | Histogram | Head/Query peer | Histogram of query peer availability if bucket = 0 primary_peer_available, 1 secondary_peer_available and 2 no_reachable_peers |
query_attempts_total | Histogram | Head/Query peer | Histogram of query attempts total |
query_response_seconds | Histogram | Head/Query peer | Histogram of query response total in seconds |
query_rows_read_total | Histogram | Head/Query peer | Histogram of query rows read total |
query_read_bytes | Histogram | Head/Query peer | Histogram of query read bytes total |
query_rows_written_total | Histogram | Head/Query peer | Histogram of query rows written total |
Batch Metrics
These metrics track activity specific to batch ingestion.
Metric | Type | Components | Purpose |
---|---|---|---|
processed_count | Counter | Batch peer | Count of items processed. |
processed_failure | Counter | Batch peer | Count of processing failures. |
processing_duration_histo | Histogram | Batch peer | Histogram of Batch processing durations in milliseconds. |
processing_duration_summary | Summary | Batch peer | Summary of Batch processing durations in milliseconds. |
rows_read | Counter | Batch peer | Count of rows read. |
Merge Metrics
These metrics correspond to Hydrolix's merge service.
Metric | Type | Components | Purpose |
---|---|---|---|
merge_duration_summary | Summary | Merge peer | Merge processing duration, in milliseconds. |
merge_duration_histo | Histogram | Merge peer | Merge processing duration, in milliseconds. |
merge_sdk_duration_summary | Summary | Merge peer | Merge SDK processing duration, in milliseconds. |
merge_sdk_duration_histo | Histogram | Merge peer | Merge SDK processing duration, in milliseconds. |
merge_candidate_histo | Histogram | Merge peer | Partitions per merge candidate. |
merge_candidate_inactive | Counter | Merge peer | Merge candidates skipped due to an inactive partition within the candidate |
merge_candidate_construction_summary | Summary | Merge head | Time spent building merge candidates, in milliseconds. |
merge_queue_full | Counter | Merge head | Times candidate generation was skipped due to a full queue |
merge_success | Counter | Merge peer | Count of merge successes. |
merge_failure | Counter | Merge peer | Count of merge successes. |
Rabbit MQ
https://www.rabbitmq.com/prometheus.html
Streaming Metrics
HTTP Stream Ingest
These metrics are specific to the use of streaming data sources.
Metric | Type | Components | Purpose |
---|---|---|---|
hdx_sink_backlog_bytes_count | Gauge | Intake head | Total bytes of all partition buckets in sink backlog waiting to be indexed. Only produced when intake_head_index_backlog_enabled is true . |
hdx_sink_backlog_items_count | Gauge | Intake head | Total count of partition buckets in sink backlog waiting to be indexed. Only produced when intake_head_index_backlog_enabled is true . |
hdx_sink_backlog_dropped_bytes_count | Counter | Intake head | Total bytes of partition buckets dropped due to backlog growing too big. Only produced when intake_head_index_backlog_enabled is true . |
hdx_sink_backlog_dropped_items_count | Counter | Intake head | Count of partition buckets dropped due to backlog growing too big. Only produced when intake_head_index_backlog_enabled is true . |
hdx_sink_backlog_delivery_count | Counter | Intake head | Count of backlog buckets successfully handed off to indexing. Only produced when intake_head_index_backlog_enabled is true . |
hdx_sink_backlog_trim_duration_ns | Histogram | Intake head | Time to trim the backlog in nanoseconds. Only produced when intake_head_index_backlog_enabled is true . |
http_source_byte_count | Counter | Stream head | Count of bytes processed. |
http_source_request_count | Counter | Stream head | Count of http requests. |
http_source_request_duration_ns | Histogram | Stream head | A histogram of HTTP request durations in nanoseconds. |
http_source_request_error_count | Counter | Stream head | Count of http request failures. |
http_source_row_count | Counter | Stream head | Count of rows processed. |
http_source_value_count | Counter | Stream head | Count of values processed. |
kinesis_source_byte_count | Counter | Stream peer | Count of bytes read from Kinesis. |
kinesis_source_checkpoint_count | Counter | Stream peer | Count of Kinesis checkpoint operations. |
kinesis_source_checkpoint_duration_ns | Histogram | Stream peer | Duration of Kinesis checkpoint operations. |
kinesis_source_checkpoint_error_count | Counter | Stream peer | Count of errors in Kinesis checkpoint operations. |
kinesis_source_error_count | Counter | Stream peer | Count of errors in Kinesis source reads. |
kinesis_source_lag_ms | Gauge | Stream peer | Measure of lag in Kinesis source. |
kinesis_source_operation_count | Counter | Stream peer | Count of operations on Kinesis. |
kinesis_source_operation_duration_ns | Histogram | Stream peer | Histogram of duration of operations on Kinesis. |
kinesis_source_record_count | Counter | Stream peer | Count of records read from Kinesis. |
kinesis_source_row_count | Counter | Stream peer | Count of rows read from Kinesis. |
kinesis_source_value_count | Counter | Stream peer | Count of values read from Kinesis. |
Redpanda
https://docs.redpanda.com/docs/cluster-administration/monitoring/
Kafka Ingest
These metrics are specific to the use of Kafka data sources.
Metric | Type | Components | Purpose |
---|---|---|---|
kafka_source_byte_count | Counter | Stream peer | Count of bytes read from Kafka. |
kafka_source_commit_duration_ns | Histogram | Stream peer | Kafka commit duration. |
kafka_source_read_count | Counter | Stream peer | Count of Kafka reads. |
kafka_source_read_duration_ns | Histogram | Stream peer | Kafka read duration. |
kafka_source_read_error_count | Counter | Stream peer | Count of Kafka errors. |
kafka_source_row_count | Counter | Stream peer | Count of rows processed. |
kafka_source_value_count | Counter | Stream peer | Count of values processed. |
DNS Metrics
These metrics are specific to the use of Kafka data sources.
Metric | Type | Components | Purpose |
---|---|---|---|
dns_num_ips_in_cache | Histogram | (ingest) | The size of the IP pool used in the DNS system. |
dns_lookup_time | Histogram | (ingest) | Milliseconds per lookup. |
dns_ttl | Histogram | (ingest) | TTLs received per lookup. |
Go Environment Metrics
These metrics track resources used by Hydrolix's Go environments.
Metric | Type | Components | Purpose |
---|---|---|---|
go_gc_duration_seconds | Summary | Batch peer, Stream head, Stream peer | A summary of the pause duration of garbage collection cycles. |
go_goroutines | Gauge | Batch peer, Stream head, Stream peer | Number of goroutines that currently exist. |
go_info | Gauge | Batch peer, Stream head, Stream peer | Information about the Go environment. |
go_memstats_alloc_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes allocated and still in use. |
go_memstats_alloc_bytes_total | Counter | Batch peer, Stream head, Stream peer | Total number of bytes allocated, even if freed. |
go_memstats_buck_hash_sys_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes used by the profiling bucket hash table. |
go_memstats_frees_total | Counter | Batch peer, Stream head, Stream peer | Total number of frees. |
go_memstats_gc_cpu_fraction | Gauge | Batch peer, Stream head, Stream peer | The fraction of this program's available CPU time used by the GC since the program started. |
go_memstats_gc_sys_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes used for garbage collection system metadata. |
go_memstats_heap_alloc_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of heap bytes allocated and still in use. |
go_memstats_heap_idle_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of heap bytes waiting to be used. |
go_memstats_heap_inuse_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of heap bytes that are in use. |
go_memstats_heap_objects | Gauge | Batch peer, Stream head, Stream peer | Number of allocated objects. |
go_memstats_heap_released_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of heap bytes released to OS. |
go_memstats_heap_sys_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of heap bytes obtained from system. |
go_memstats_last_gc_time_seconds | Gauge | Batch peer, Stream head, Stream peer | Number of seconds since 1970 of last garbage collection. |
go_memstats_lookups_total | Counter | Batch peer, Stream head, Stream peer | Total number of pointer lookups. |
go_memstats_mallocs_total | Counter | Batch peer, Stream head, Stream peer | Total number of mallocs. |
go_memstats_mcache_inuse_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes in use by mcache structures. |
go_memstats_mcache_sys_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes used for mcache structures obtained from system. |
go_memstats_mspan_inuse_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes in use by mspan structures. |
go_memstats_mspan_sys_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes used for mspan structures obtained from system. |
go_memstats_next_gc_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of heap bytes when next garbage collection will take place. |
go_memstats_other_sys_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes used for other system allocations. |
go_memstats_stack_inuse_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes in use by the stack allocator. |
go_memstats_stack_sys_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes obtained from system for stack allocator. |
go_memstats_sys_bytes | Gauge | Batch peer, Stream head, Stream peer | Number of bytes obtained from system. |
go_threads | Gauge | Batch peer, Stream head, Stream peer | Number of OS threads created. |
PostgreSQL Pool Metrics
Metric | Type | Purpose |
---|---|---|
pgx_pool_total_acquire_count | Count | The cumulative count of successful acquires from the pool. |
pgx_pool_total_acquire_duration_ns_count | Count | The total duration of all successful acquires from the pool. |
pgx_pool_total_acquire_cancel_count | Count | The cumulative count of acquires from the pool that were canceled by a context. |
pgx_pool_total_acquire_empty_count | Count | The cumulative count of successful acquires from the pool that waited for a resource to be released or constructed because the pool was empty. |
pgx_pool_total_conns_opened_count | Count | The cumulative count of new connections opened. |
pgx_pool_total_destroyed_max_lifetime_count | Count | The cumulative count of connections destroyed because they exceeded MaxConnLifetime. |
pgx_pool_total_destroyed_max_idle_count | Count | The cumulative count of connections destroyed because they exceeded MaxConnIdleTime. |
pgx_pool_current_size | Gauge | The total number of resources currently in the pool. |
pgx_pool_current_constructing | Gauge | The number of connections with construction in progress in the pool. |
pgx_pool_current_acquired | Gauge | The number of currently acquired connections in the pool. |
pgx_pool_current_idle | Gauge | The number of currently idle connections in the pool. |
pgx_pool_max | Gauge | The maximum size of the pool. |
Updated about 2 months ago