Use Intake Pool Monitoring
Query heartbeat data, configure pool monitoring, and set up alerts for the monitor-ingest service.
For an overview of how intake pool monitoring works, see Intake pool monitoring.
View active pools⚓︎
Check which pools are being monitored and their recent activity:
Expected results⚓︎
- Monitored pools should show ~300 heartbeats over five minutes (60/min × 5 min)
- Exempted pools shouldn't appear in results
Detect data drops⚓︎
Identify gaps in heartbeat data that may indicate ingestion issues:
Status interpretation⚓︎
- Healthy pools: 55-65 heartbeats per minute
- Degraded: 30-54 heartbeats per minute
- Severe issues: <30 heartbeats per minute
- Complete outage: No results for pool
Compare latency across pools⚓︎
Measure ingestion latency to identify performance differences:
Interpret latency⚓︎
- Consistently high p95/p99 latency indicates pool performance issues
- Sudden spikes in max latency may indicate temporary resource constraints
- Significant variance between pools suggests load imbalance
Identify missing heartbeats⚓︎
Detect prolonged gaps that indicate outages:
Thresholds⚓︎
- 5-10 seconds: Minor issue, investigate if recurring
- 10-60 seconds: Moderate issue, likely pod restart or network problem
- 60 seconds or more: Unresponsive pool requiring immediate attention
Enable monitor-ingest⚓︎
Enable the service in the HydrolixCluster resource:
| Enable Monitor-Ingest Service | |
|---|---|
The service is disabled by default (monitor_ingest: false).
Exempt pools from monitoring⚓︎
Exclude specific pools using the monitor_ingest_pool_exemptions tunable.
| Configure Pool Exemptions | |
|---|---|
Common exemption use cases⚓︎
- Internal system pools - Pools used for cluster logs or diagnostics (for example,
hydrologs) - Dedicated test pools - Pools reserved for testing or development
- Low-priority pools - Pools where monitoring overhead isn't justified
- Third-party integrations - Pools managed by external systems
Configuration tunables⚓︎
For configuration options including timeout settings and pool exemptions, see the Hydrolix tunables reference.
Prometheus metrics⚓︎
The monitor-ingest service exposes metrics for monitoring and alerting:
| Metric | Type | Description |
|---|---|---|
hydromon_retry_timeout_exceeded |
counter | Failed heartbeat requests that exceeded retry timeout |
hydromon_submission_in_progress |
gauge | Heartbeat requests currently in progress |
hydromon_time_for_submission |
histogram | Time from request initiation to completion |
hydromon_time_for_request |
histogram | Time spent on HTTP request/response |
hydromon_error |
counter | Failed heartbeat requests |
Example Prometheus queries⚓︎
In progress heartbeat rate over five minutes:
| Heartbeat Submission Rate Over the Last 5 Minutes | |
|---|---|
Failed submission rate over five minutes:
| Failed Submission Rate Over the Last 5 Minutes | |
|---|---|
Submission latency (p95) over five minutes:
| Submission latency (p95) Over the Last 5 Minutes | |
|---|---|
Alerting strategies⚓︎
Use Prometheus alerts to keep track of heartbeat data.
Critical: High error rate⚓︎
Alert when heartbeat submission errors increase:
| Critical Alert: High Error Rate | |
|---|---|
Warning: Slow heartbeat submissions⚓︎
Alert when submissions are taking too long:
| Warning Alert: Slow Submissions | |
|---|---|
Warning: Retry timeouts⚓︎
Alert when heartbeat requests exceed retry timeout:
| Warning Alert: Retry Timeouts | |
|---|---|
Troubleshooting⚓︎
Use these diagnostic steps to resolve common issues with intake pool monitoring.
No data in hydro.monitor table⚓︎
Check if the monitor-ingest service is enabled⚓︎
| Check Service Configuration | |
|---|---|
Verify monitor-ingest pod is running⚓︎
| Check Pod Status | |
|---|---|
Check monitor-ingest pod logs for errors⚓︎
| View Pod Logs | |
|---|---|
Pool not appearing in results⚓︎
Verify pool service type⚓︎
| Verify Pool Service Type | |
|---|---|
Check if pool is exempted⚓︎
| Check Pool Exemption Status | |
|---|---|
Unexpected latency spikes⚓︎
Check for resource constraints⚓︎
| Check Resource Usage | |
|---|---|
Review Prometheus metrics for correlation⚓︎
| Review Error and Latency Metrics | |
|---|---|
Investigate intake pool logs⚓︎
| View Intake Pool Logs | |
|---|---|