Hydrologs Query Examples
Examples of queries for hydro.logs
Overview
These examples provide basic and advanced query information to use the hydro.logs
table for observability.
Use queries in Grafana
All query examples on this page work in Grafana dashboards with a timestamp alteration. Replace the static timestamp filters with dynamic time functions.
Instead of:
WHERE timestamp > NOW() - INTERVAL 1 DAY
Use:
WHERE $__timeFilter(timestamp)
This change lets Grafana apply the dashboard’s time range to your query.
General query tips
When filtering log messages for specific text, use position()
instead of LIKE ‘%text%’
or ILIKE ‘%text%’
. The position()
function is more efficient because it’s optimized for substring searches without triggering full table scans or expression slowdowns.
Considerations when querying hydro.logs
hydro.logs
This table is intentionally large and wide. It includes hundreds of fields to support deep observability and debugging. It's important to use targeted filters and time bounds to keep your queries fast and efficient.
System health
Top errors by component (last 24 hours)
Identify the components generating the most errors in the last 24 hours.
SELECT component, count(*) AS error_count
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
AND level = 'error'
GROUP BY component
ORDER BY error_count DESC
LIMIT 10;
High error rates by app (last hour)
See which applications are generating the most errors in the last hour.
SELECT app, count(*) AS errors
FROM hydro.logs
WHERE level = 'error'
AND timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY app
ORDER BY errors DESC;
Group errors by app
This example groups similar error messages by app.
SELECT app,
substring(error, 1, 80) AS error_prefix,
count(*) AS occurrences
FROM hydro.logs
WHERE level = 'error'
AND timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY app, error_prefix
ORDER BY occurrences DESC
LIMIT 50;
For more specialized grouping, try the following options:
- Use
extract()
orreplaceRegexpAll()
to normalize specific parts of the error message. For example, replace numbers or UUIDs with placeholders. - Group by
component
,container
, orpod_name
for deeper breakdowns of errors.
Group similar errors
This example replaces values with [num]
instead of 1024 MB
or 3.5 GB
. The query groups errors like merge would use too much memory (estimated: 512 MB, allowed: 128 MB)
into one normalized line.
SELECT app,
replaceRegexpAll(error, '\\d+(\\.\\d+)?(\\s?[A-Z]+)?', '[num]') AS normalized_error,
count(*) AS occurrences
FROM hydro.logs
WHERE level = 'error'
AND position(error, 'memory') != 0
AND timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY app, normalized_error
ORDER BY occurrences DESC
LIMIT 50;
Group errors with scrub to make them more readable
This example scrubs long hex strings and UUIDs to make errors look unique and more readable:
SELECT app,
replaceRegexpAll(error, '([a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12})|([a-f0-9]{32,})', '[id]') AS normalized_error,
count(*) AS occurrences
FROM hydro.logs
WHERE level = 'error'
AND timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY app, normalized_error
ORDER BY occurrences DESC
LIMIT 50;
Count all logs from intake-head in the last 12 hours
SELECT count(*)
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 12 HOUR
AND app = 'intake-head';
Application failures
OOMKilled
or out of memory
errors by pod
OOMKilled
or out of memory
errors by podFind pods reporting out of memory errors.
SELECT kubernetes.pod_name, message, timestamp
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
AND stream = 'stderr'
AND level = 'error'
AND (
position(message, 'OOMKilled') != 0
OR position(message, 'out of memory') != 0
);
CrashLoopBackOff
or container restart issues
CrashLoopBackOff
or container restart issuesList pods that have repeatedly failed to restart.
SELECT kubernetes.pod_name, count(*) AS crash_count
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 12 HOUR
AND position(message, 'Back-off restarting failed container') != 0
GROUP BY kubernetes.pod_name
ORDER BY crash_count DESC;
Segfault detection by app
Look for fatal errors caused by segmentation faults across key components.
SELECT message, app
FROM hydro.logs
WHERE position(message, 'fatal error') != 0
AND timestamp > NOW() - INTERVAL 1 WEEK
AND app IN ('intake-head', 'query-head', 'query-peer', 'merge-peer', 'merge-head')
AND component = 'signal_handler';
Query and ingest troubleshooting
Query failures or timeouts
Diagnose query timeouts or execution failures.
SELECT timestamp, message, query_id, error
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 2 HOUR
AND level = 'error'
AND app IN ('query-head', 'query-peer')
AND (
position(error, 'timeout') != 0
OR position(error, 'failed') != 0
OR position(message, 'query execution error') != 0
);
Invalid or missing table references
Detect errors related to nonexistent or misconfigured tables.
SELECT timestamp, app, message
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 6 HOUR
AND level = 'error'
AND (
position(message, 'table not found') != 0
OR position(message, 'invalid table') != 0
);
Data corruption and compression issues
ZSTD decompression failures by component
Find data corruption or decompression failures.
SELECT component, count(*) AS zstd_errors
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
AND level = 'error'
AND app = 'intake-head'
AND position(error, 'failed to decompress ZSTD') != 0
GROUP BY component
ORDER BY zstd_errors DESC;
Daily error count - ZSTD decompression and fatal errors
Track fatal errors and ZSTD decompression failures by app and container.
SELECT toStartOfDay(timestamp) AS day,
countIf(position(message, 'fatal error') != 0 AND app = 'intake-head' AND container = 'turbine') AS intake_head_segfault,
countIf(position(exception, 'failed to decompress ZSTD') != 0) AS zstd_error,
app, container
FROM hydro.logs
WHERE timestamp > '2024-07-24 15:00:53'
AND app NOT IN ('query-peer', 'query-head')
GROUP BY app, container, day
HAVING intake_head_segfault != 0 OR zstd_error != 0
ORDER BY day ASC;
Partition issues
Corrupt partition extraction
Extract and list corrupt partitions based on failure messages from the past 24 hours.
SELECT DISTINCT extractAll(assumeNotNull(error), 'partition=(.*?\\.hdx)')[1] AS corrupt_partition
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
AND position(error, 'failed') != 0;
Corrupt partitions by frequency
Find corrupt partitions and count how often each appears in decompression errors.
SELECT DISTINCT extractAll(assumeNotNull(error), 'partition=(.*?\\.hdx)')[1] AS corrupt_partition,
COUNT(*) AS error_count
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
AND app IN ('merge-peer')
AND level IN ('error')
AND position(error, 'BasicBuf: failed to decompress ZSTD block Data corruption detected') != 0
GROUP BY corrupt_partition
ORDER BY error_count DESC;
Aggregated log views
Log aggregation for error messages
Group logs, excluding info level, by timestamp, log level, and error message.
SELECT toStartOfMinute(timestamp) AS t, level, message, error, count()
FROM hydro.logs
WHERE container != 'turbine'
AND level != 'info'
AND app = 'intake-head'
AND timestamp > '2024-08-12 02:30:00'
AND timestamp < '2024-08-12 04:40:00'
GROUP BY t, level, message, error
ORDER BY t;
Error messages in the last five minutes
Find the most recent error messages logged by traefik.
SELECT message
FROM hydro.logs
WHERE timestamp > NOW() - INTERVAL 5 MINUTE
AND app = 'traefik'
LIMIT 10;
Updated 6 days ago