Use Intake Pool Monitoring

Query heartbeat data, configure pool monitoring, and set up alerts for the monitor-ingest service.

For an overview of how intake pool monitoring works, see Intake pool monitoring.

View active pools⚓︎

Check which pools are being monitored and their recent activity:

View Active Pool Monitoring
SELECT
  intake_pool,
  COUNT(*) as heartbeat_count,
  MIN(timestamp) as first_heartbeat,
  MAX(timestamp) as last_heartbeat,
  dateDiff('second', MIN(timestamp), MAX(timestamp)) as monitoring_duration_seconds
FROM hydro.monitor
WHERE timestamp > NOW() - INTERVAL 5 MINUTE
GROUP BY intake_pool
ORDER BY heartbeat_count DESC

Expected results⚓︎

Monitored pools should show ~300 heartbeats over five minutes (60/min × 5 min)
Exempted pools shouldn't appear in results

Detect data drops⚓︎

Identify gaps in heartbeat data that may indicate ingestion issues:

Detect Data Drops
SELECT
  intake_pool,
  toStartOfMinute(timestamp) as minute,
  COUNT(*) as heartbeats,
  MIN(timestamp) as first_heartbeat,
  MAX(timestamp) as last_heartbeat
FROM hydro.monitor
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY intake_pool, minute
HAVING heartbeats < 50  -- Expected ~60 per minute
ORDER BY minute DESC, intake_pool

Status interpretation⚓︎

Healthy pools: 55-65 heartbeats per minute
Degraded: 30-54 heartbeats per minute
Severe issues: <30 heartbeats per minute
Complete outage: No results for pool

Compare latency across pools⚓︎

Measure ingestion latency to identify performance differences:

Compare Latency Across Pools
SELECT
  intake_pool,
  COUNT(*) as samples,
  AVG(dateDiff('millisecond', monitor_request_timestamp, timestamp)) as avg_latency_ms,
  quantile(0.50)(dateDiff('millisecond', monitor_request_timestamp, timestamp)) as p50_latency_ms,
  quantile(0.95)(dateDiff('millisecond', monitor_request_timestamp, timestamp)) as p95_latency_ms,
  quantile(0.99)(dateDiff('millisecond', monitor_request_timestamp, timestamp)) as p99_latency_ms,
  MAX(dateDiff('millisecond', monitor_request_timestamp, timestamp)) as max_latency_ms
FROM hydro.monitor
WHERE timestamp > NOW() - INTERVAL 15 MINUTE
GROUP BY intake_pool
ORDER BY avg_latency_ms DESC

Interpret latency⚓︎

Consistently high p95/p99 latency indicates pool performance issues
Sudden spikes in max latency may indicate temporary resource constraints
Significant variance between pools suggests load imbalance

Identify missing heartbeats⚓︎

Detect prolonged gaps that indicate outages:

Identify Missing Heartbeats
SELECT
  intake_pool,
  gap_start,
  gap_end,
  gap_seconds
FROM (
  SELECT
    intake_pool,
    timestamp as gap_start,
    lagInFrame(timestamp, -1) OVER (PARTITION BY intake_pool ORDER BY timestamp) as gap_end,
    dateDiff('second', timestamp, lagInFrame(timestamp, -1) OVER (PARTITION BY intake_pool ORDER BY timestamp)) as gap_seconds
  FROM hydro.monitor
  WHERE timestamp > NOW() - INTERVAL 1 HOUR
)
WHERE gap_seconds > 5
ORDER BY gap_seconds DESC, intake_pool

Thresholds⚓︎

5-10 seconds: Minor issue, investigate if recurring
10-60 seconds: Moderate issue, likely pod restart or network problem
60 seconds or more: Unresponsive pool requiring immediate attention

Enable monitor-ingest⚓︎

Enable the service in the HydrolixCluster resource:

Enable Monitor-Ingest Service
apiVersion: hydrolix.io/v1
kind: HydrolixCluster
metadata:
  name: hdx
spec:
  monitor_ingest: true

The service is disabled by default (monitor_ingest: false).

Exempt pools from monitoring⚓︎

Exclude specific pools using the monitor_ingest_pool_exemptions tunable.

Configure Pool Exemptions
spec:
  monitor_ingest: true
  monitor_ingest_pool_exemptions:
    - hydrologs-intake-head
    - internal-diagnostics-pool
    - test-intake-pool

Common exemption use cases⚓︎

Internal system pools - Pools used for cluster logs or diagnostics (for example, hydrologs)
Dedicated test pools - Pools reserved for testing or development
Low-priority pools - Pools where monitoring overhead isn't justified
Third-party integrations - Pools managed by external systems

Configuration tunables⚓︎

For configuration options including timeout settings and pool exemptions, see the Hydrolix tunables reference.

Prometheus metrics⚓︎

The monitor-ingest service exposes metrics for monitoring and alerting:

Metric	Type	Description
`hydromon_retry_timeout_exceeded`	counter	Failed heartbeat requests that exceeded retry timeout
`hydromon_submission_in_progress`	gauge	Heartbeat requests currently in progress
`hydromon_time_for_submission`	histogram	Time from request initiation to completion
`hydromon_time_for_request`	histogram	Time spent on HTTP request/response
`hydromon_error`	counter	Failed heartbeat requests

Example Prometheus queries⚓︎

In progress heartbeat rate over five minutes:

Heartbeat Submission Rate Over the Last 5 Minutes
rate(hydromon_submission_in_progress[5m])

Failed submission rate over five minutes:

Failed Submission Rate Over the Last 5 Minutes
rate(hydromon_error[5m])

Submission latency (p95) over five minutes:

Submission latency (p95) Over the Last 5 Minutes
histogram_quantile(0.95, rate(hydromon_time_for_submission[5m]))

Alerting strategies⚓︎

Use Prometheus alerts to keep track of heartbeat data.

Critical: High error rate⚓︎

Alert when heartbeat submission errors increase:

Critical Alert: High Error Rate
- alert: IntakePoolHeartbeatErrors
  expr: |
    rate(hydromon_error[5m]) > 0.1
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "High heartbeat error rate detected"
    description: "Error rate is {{ $value }} per second (threshold: 0.1/sec)"

Warning: Slow heartbeat submissions⚓︎

Alert when submissions are taking too long:

Warning Alert: Slow Submissions
- alert: IntakePoolSlowHeartbeats
  expr: |
    histogram_quantile(0.95, rate(hydromon_time_for_submission[5m])) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Slow heartbeat submissions"
    description: "P95 submission time is {{ $value }} seconds (threshold: 5s)"

Warning: Retry timeouts⚓︎

Alert when heartbeat requests exceed retry timeout:

Warning Alert: Retry Timeouts
- alert: IntakePoolRetryTimeouts
  expr: |
    rate(hydromon_retry_timeout_exceeded[5m]) > 0.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Heartbeat retry timeouts detected"
    description: "Timeout rate is {{ $value }} per second (threshold: 0.05/sec)"

Troubleshooting⚓︎

Use these diagnostic steps to resolve common issues with intake pool monitoring.

No data in hydro.monitor table⚓︎

Check if the `monitor-ingest` service is enabled⚓︎

Check Service Configuration
kubectl get hdx -n <namespace> -o yaml | grep monitor_ingest

Verify monitor-ingest pod is running⚓︎

Check Pod Status
kubectl get pods -n <namespace> | grep monitor-ingest

Check monitor-ingest pod logs for errors⚓︎

View Pod Logs
kubectl logs -n <namespace> -l app=monitor-ingest

Pool not appearing in results⚓︎

Verify pool service type⚓︎

Verify Pool Service Type
# Pool must be one of these service types:
pools:
  - name: my-pool
    service: intake-head  # or http-head, stream-head

Check if pool is exempted⚓︎

Check Pool Exemption Status
spec:
  monitor_ingest_pool_exemptions:
    - my-pool  # Remove if this pool should be monitored

Unexpected latency spikes⚓︎

Check for resource constraints⚓︎

Check Resource Usage
kubectl top pods -n <namespace> | grep intake

Review Prometheus metrics for correlation⚓︎

Review Error and Latency Metrics
rate(hydromon_error[5m])
histogram_quantile(0.95, rate(hydromon_time_for_submission[5m]))

Investigate intake pool logs⚓︎

View Intake Pool Logs
kubectl logs -n <namespace> -l service=intake-head