Hydrologs Query Examples

Overview

These examples provide basic and advanced query information to use the hydro.logs table for observability.

Use queries in Grafana

All query examples on this page work in Grafana dashboards with a timestamp alteration. Replace the static timestamp filters with dynamic time functions.

Instead of:

WHERE timestamp > NOW() - INTERVAL 1 DAY

Use:

WHERE $__timeFilter(timestamp)

This change lets Grafana apply the dashboard’s time range to your query.

General query tips

When filtering log messages for specific text, use position() instead of LIKE ‘%text%’ or ILIKE ‘%text%’. The position() function is more efficient because it’s optimized for substring searches without triggering full table scans or expression slowdowns.

Considerations when querying `hydro.logs`

This table is intentionally large and wide. It includes hundreds of fields to support deep observability and debugging. It's important to use targeted filters and time bounds to keep your queries fast and efficient.

System health

Top errors by component (last 24 hours)

Identify the components generating the most errors in the last 24 hours.

SELECT component, count(*) AS error_count
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
  AND level = 'error'
GROUP BY component
ORDER BY error_count DESC
LIMIT 10;

High error rates by app (last hour)

See which applications are generating the most errors in the last hour.

SELECT app, count(*) AS errors
FROM hydro.logs
WHERE level = 'error'
  AND timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY app
ORDER BY errors DESC;

Group errors by app

This example groups similar error messages by app.

SELECT app,
       substring(error, 1, 80) AS error_prefix,
       count(*) AS occurrences
FROM hydro.logs
WHERE level = 'error'
  AND timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY app, error_prefix
ORDER BY occurrences DESC
LIMIT 50;

For more specialized grouping, try the following options:

Use extract() or replaceRegexpAll() to normalize specific parts of the error message. For example, replace numbers or UUIDs with placeholders.
Group by component, container, or pod_name for deeper breakdowns of errors.

Group similar errors

This example replaces values with [num] instead of 1024 MB or 3.5 GB. The query groups errors like merge would use too much memory (estimated: 512 MB, allowed: 128 MB) into one normalized line.

SELECT app,
       replaceRegexpAll(error, '\\d+(\\.\\d+)?(\\s?[A-Z]+)?', '[num]') AS normalized_error,
       count(*) AS occurrences
FROM hydro.logs
WHERE level = 'error'
  AND position(error, 'memory') != 0
  AND timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY app, normalized_error
ORDER BY occurrences DESC
LIMIT 50;

Group errors with scrub to make them more readable

This example scrubs long hex strings and UUIDs to make errors look unique and more readable:

SELECT app,
       replaceRegexpAll(error, '([a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12})|([a-f0-9]{32,})', '[id]') AS normalized_error,
       count(*) AS occurrences
FROM hydro.logs
WHERE level = 'error'
  AND timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY app, normalized_error
ORDER BY occurrences DESC
LIMIT 50;

Count all logs from intake-head in the last 12 hours

SELECT count(*)
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 12 HOUR
  AND app = 'intake-head';

Application failures

`OOMKilled` or `out of memory` errors by pod

Find pods reporting out of memory errors.

SELECT kubernetes.pod_name, message, timestamp
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
  AND stream = 'stderr'
  AND level = 'error'
  AND (
    position(message, 'OOMKilled') != 0
    OR position(message, 'out of memory') != 0
  );

`CrashLoopBackOff` or container restart issues

List pods that have repeatedly failed to restart.

SELECT kubernetes.pod_name, count(*) AS crash_count
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 12 HOUR
  AND position(message, 'Back-off restarting failed container') != 0
GROUP BY kubernetes.pod_name
ORDER BY crash_count DESC;

Segfault detection by app

Look for fatal errors caused by segmentation faults across key components.

SELECT message, app
FROM hydro.logs
WHERE position(message, 'fatal error') != 0
  AND timestamp > NOW() - INTERVAL 1 WEEK
  AND app IN ('intake-head', 'query-head', 'query-peer', 'merge-peer', 'merge-head')
  AND component = 'signal_handler';

Query and ingest troubleshooting

Query failures or timeouts

Diagnose query timeouts or execution failures.


SELECT timestamp, message, query_id, error
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 2 HOUR
  AND level = 'error'
  AND app IN ('query-head', 'query-peer')
  AND (
    position(error, 'timeout') != 0
    OR position(error, 'failed') != 0
    OR position(message, 'query execution error') != 0
  );

Invalid or missing table references

Detect errors related to nonexistent or misconfigured tables.

SELECT timestamp, app, message
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 6 HOUR
  AND level = 'error'
  AND (
    position(message, 'table not found') != 0
    OR position(message, 'invalid table') != 0
  );

Data corruption and compression issues

ZSTD decompression failures by component

Find data corruption or decompression failures.

SELECT component, count(*) AS zstd_errors
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
  AND level = 'error'
  AND app = 'intake-head'
  AND position(error, 'failed to decompress ZSTD') != 0
GROUP BY component
ORDER BY zstd_errors DESC;

Daily error count - ZSTD decompression and fatal errors

Track fatal errors and ZSTD decompression failures by app and container.

SELECT toStartOfDay(timestamp) AS day,
       countIf(position(message, 'fatal error') != 0 AND app = 'intake-head' AND container = 'turbine') AS intake_head_segfault,
       countIf(position(exception, 'failed to decompress ZSTD') != 0) AS zstd_error,
       app, container
FROM hydro.logs
WHERE timestamp > '2024-07-24 15:00:53'
  AND app NOT IN ('query-peer', 'query-head')
GROUP BY app, container, day
HAVING intake_head_segfault != 0 OR zstd_error != 0
ORDER BY day ASC;

Partition issues

Corrupt partition extraction

Extract and list corrupt partitions based on failure messages from the past 24 hours.

SELECT DISTINCT extractAll(assumeNotNull(error), 'partition=(.*?\\.hdx)')[1] AS corrupt_partition
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
  AND position(error, 'failed') != 0;

Corrupt partitions by frequency

Find corrupt partitions and count how often each appears in decompression errors.

SELECT DISTINCT extractAll(assumeNotNull(error), 'partition=(.*?\\.hdx)')[1] AS corrupt_partition,
       COUNT(*) AS error_count
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
  AND app IN ('merge-peer')
  AND level IN ('error')
  AND position(error, 'BasicBuf: failed to decompress ZSTD block Data corruption detected') != 0
GROUP BY corrupt_partition
ORDER BY error_count DESC;

Aggregated log views

Log aggregation for error messages

Group logs, excluding info level, by timestamp, log level, and error message.

SELECT toStartOfMinute(timestamp) AS t, level, message, error, count()
FROM hydro.logs
WHERE container != 'turbine'
  AND level != 'info'
  AND app = 'intake-head'
  AND timestamp > '2024-08-12 02:30:00'
  AND timestamp < '2024-08-12 04:40:00'
GROUP BY t, level, message, error
ORDER BY t;

Error messages in the last five minutes

Find the most recent error messages logged by traefik.

SELECT message
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 5 MINUTE
  AND app = 'traefik'
LIMIT 10;