Using Anomaly Detection

Limited Availability

Anomaly Detection is currently in Limited Availability. Contact support@hydrolix.io to learn more about access.

This guide covers how to use Anomaly Detection in your daily operations, from viewing anomalies in dashboards to investigating contributing factors for an incident.

Once Anomaly Detection is enabled, you'll have access to pre-built Grafana dashboards that display detected anomalies alongside your metrics.

Key features⚓︎

Anomaly Detection includes:

Anomaly indicators marking when and where anomalies occurred
Contextual dashboards correlate anomalous events with relevant metrics for faster and more accurate analysis
Natural language descriptions which provide LLM-backed interpretations of anomaly data and recommended next steps.
Alerts functionality by interoperating with Grafana alerts

Grafana Filter⚓︎

Use the rca_ids Grafana filter at the top of the dashboard to narrow results. Each RCA report is assigned a unique rca_id that groups related anomalies together. Filter to view anomalies and RCA reports associated with a specific RCA ID.

Root Cause Analysis (RCA)⚓︎

RCA uses AI-powered analysis to correlate anomalies and provide actionable insights. When RCA detects one or more related anomalies, it generates a report directly in the Anomaly Detection dashboard.

Understanding RCA output in an example scenario⚓︎

Scenario Description

Multiple anomalies detected which indicate high latency for Cloudfront in the UK region on February 3, 2026

A timeline of the anomalies⚓︎

Evidence Timeline

The Timeline chart shows multiple anomalies which are elevated ttlb_p95 or Time to Last Byte at the 95th percentile. This indicates an elevated time it takes for a Cloudfront edge server to fully deliver a response to a client, measured at the 95th percentile across all requests in a given window. A spike in ttlb_p95 indicates that 5% of the slowest requests are taking significantly longer than normal to complete, which is a signal of performance degradation.

A summary of what happened⚓︎

The RCA report generates a natural language summary of the anomalies.

What happened Error rate by CDN, ISO code, ASN

Severity and duration⚓︎

Key charts and LLM-generated problem analysis are included in the RCA report.

High-level description of what was detected
Issue duration
Severity assessment

Severity, duration, description

An impact statement⚓︎

Most impacted

Recommended actions⚓︎

Immediate steps to investigate further
Potential remediation actions listed in order of likelihood of resolving the issue.
Monitoring adjustments to prevent recurrence

Actions Actions

The report suggests shifting UK traffic to an alternate CDN provider or region, implementing client-side retries with exponential backoff, adjusting cache headers to reduce origin load, and setting alerts on ttlb_p95 for early detection of similar latency events.

Key suspects, confidence, and reasoning⚓︎

List of likely causes based on the anomaly patterns
Supporting evidence from correlated metrics

Key suspects Confidence Reasoning

The report indicates that issues at the CloudFront edge layer in the UK region are the most likely cause, potentially including upstream routing delays, edge server degradation, or misconfigured cache headers increasing origin load. The geographic isolation to UK traffic, with other regions unaffected, strongly supports a localized infrastructure issue rather than an origin-side problem.

Contributing anomalies⚓︎

The report ends with a list of the relevant anomalies.

Contributing anomalies

The report identifies a sustained ttlb_p95 anomaly isolated to a single CDN and region combination. It flags cloudfront - UK as the affected key across the cdn - country dimension, with the anomaly persisting from 9:25 AM to 1:52 PM UTC-indicating a prolonged, geographically localized performance degradation rather than a transient spike.

Bring your own LLM considerations⚓︎

If you're using your own LLM for RCA:

Periodically review the generated reports to ensure they continue to provide useful insight with the configured metrics
Work with Hydrolix customer success engineers if you need to adjust model parameters

Alerting on anomalies⚓︎

Anomaly Detection integrates with Grafana's alerting system. When an alert fires:

Notifications include links to a time-scoped root cause analysis dashboard
The alert message contains anomaly-related context, including affected dimensions and severity

To reduce alert fatigue, anomaly-based alerts focus on statistically significant deviations rather than static thresholds that may not adapt to changing baselines.

For questions about using Anomaly Detection or interpreting results, contact your Hydrolix Customer Success Engineer at support@hydrolix.io.