Summary Tables
Use summary tables in Hydrolix
Overview
Aggregating data before it's queried can decrease query latency and reduce system load. Hydrolix supports aggregation with summary tables. Use summary tables to store aggregations including minimum, maximum, count, and sum. In certain cases, summary tables can store statistical aggregations like unique counts and quantiles/percentiles.
When you use summary tables, Hydrolix aggregates data with the following steps:
- Hydrolix runs a summary SQL transform on raw table partitions to produce intermediate aggregation data.
- Hydrolix writes the output of that SQL statement to a summary table.
Queries can use the intermediate aggregation data stored in the summary table to process aggregation queries faster than they could by processing every individual piece of input data.
The series of transforms that intake peers apply to raw data.
More than one summary can exist on a parent table. For example, the diagram below shows a table that provides raw data for both minute and hourly summary tables:
Two summary tables are simultaneously produced from one parent table.
Alias columns
Alias columns in the view schema can't be used with summary tables. See Alias Columns for more information.
Summary transform structure
The summary transform defines a SQL statement that aggregates data over a period of time.
This example shows a general form for summary transforms:
SELECT
<time_field_aggregation> AS <alias>,
<function_on_column> AS <alias>,
.......
FROM
<parent_project.parent_table>
GROUP BY
<time_field_alias>
SETTINGS
hdx_primary_key = '<time_field_alias>'
Summary transforms are subject to the following limitations:
- In the
SETTINGS
clause, you must specify thehdx_primary_key
. Thehdx_primary_key
must match a SQL aggregation alias in yourSELECT
statement. This is a query-time alias, not an alias column defined in a view. - In the
FROM
clause, specify the project name and the table name of the parent table.
Summary transform example
In this example, the summary transform aggregates e-commerce data, minute by minute. The transform calculates:
- The sum of item cost
- The average tax
- The 95th percentile of distance
This example shows one way to define the summary:
SELECT
toStartOfMinute(timestamp) AS minute,
sum(cost) AS sum_cost,
avg(tax) AS avg_tax,
quantile(0.95)(distance) AS distance_p95
FROM project.parent_table
GROUP BY minute
SETTINGS hdx_primary_key = 'minute'
Limitations during transform
- When you create a summary transform, the API rejects any non-deterministic summary SQL clauses.
- Don't use functions like
now()
,today()
, orcurrent_timestamp()
in aWHERE
clause. This adds data that isn't in the source table. - It's safe to operate on timestamps used in the example, with
toStartOfMinute(timestamp)
or withINTERVAL
math. - Window functions like
PARTITION BY
orOVER
are also non-deterministic in the context of summary transform execution. They need access to all the underlying data for the window, which isn't guaranteed in distributed partitions, so window functions aren't supported in summary SQL.
Exclude columns from indexing
By default, a summary table indexes all non-aggregate columns.
To exclude columns from indexing, add this line to the transform SQL:
hdx_summary_override_indexes = 'column_1,column_2,column_3'
In this example, the sum_cost
and avg_tax
columns are not indexed.
SELECT
toStartOfMinute(timestamp) AS minute,
sum(cost) AS sum_cost,
avg(tax) AS avg_tax,
quantile(0.95)(distance) AS distance_p95
FROM project.parent_table
GROUP BY minute
SETTINGS hdx_primary_key = 'minute',
hdx_summary_override_indexes = 'sum_cost,avg_tax'
Create the summary table
You can create, update, and delete summary tables in Hydrolix by using the Hydrolix UI or the config API table management methods.
Do not use dictionaries in summary tables
Using dictionaries in summary tables breaks the Merge service. To include dictionary data in your summary table, do the lookup in the raw table, store the results as a field, then use that field in the summary. See Getting Started with Dictionary for an example of how to incorporate dictionary values into a transform.
Create a summary table with the UI
To create a summary table with the Hydrolix UI:
- In the top right of the Hydrolix UI, click + Add New.
- In the right sidebar, select Summary Table.
- Fill out the form:
- Select the project for the table
- Enter the table name and description
- Paste the summary SQL into the Summary SQL field. Be sure to include the
SETTINGS
clause.
- Click Create table.
Create a summary table with API
You can use tools like curl
, Postman, or HTTPie to make the request.
-
Log in to the API to get your bearer token. For example, with cURL:
export HDX_TOKEN=$( curl -X POST -H "Content-Type: application/json" \ https://$HDX_HOSTNAME/config/v1/login/ \ -d "{\"username\":\"$HDX_USER\",\"password\":\"$HDX_PASSWORD\"}" \ | jq -r ".auth_token.access_token" )
This command assumes you've set the
$HDX_HOSTNAME
,$HDX_USER
and
$HDX_PASSWORD
environment variables. -
Create the table using
POST
:{ "name": "my.summarytable", "description": "Minute-by-minute summary of parent table", "type": "summary", "settings": { "merge": { "enabled": true }, "summary": { "enabled": true, "sql": "SELECT toStartOfMinute(timestamp) AS minute, sum(cost) AS sum_cost, avg(tax) AS avg_tax, quantile(0.95)(distance) AS distance_p95 FROM project.parent_table GROUP BY minute SETTINGS hdx_primary_key='minute'" } } }
curl --request POST \ --url https://$HDX_HOSTNAME/config/v1/orgs/$HDX_ORG/projects/$HDX_PROJECT/tables/ \ --header "Authorization: Bearer $HDX_TOKEN" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \ --data " { \"name\": \"my.summarytable\", \"description\": \"Minute by minute summary of parent table\", \"type\": \"summary\", \"settings\": { \"merge\": { \"enabled\": true }, \"summary\": { \"enabled\": true, \"sql\": \"SELECT toStartOfMinute(timestamp) AS minute, sum(cost) AS sum_cost, avg(tax) AS avg_tax, quantile(0.95)(distance) AS distance_p95 FROM project.parent_table GROUP BY minute SETTINGS hdx_primary_key='minute'\" } } } "
{ "project": "{project_id}", "name": "my.summarytable", "description": "Minute-by-minute summary of parent table", "uuid": "68af42f6-5479-4b07-b873-4c8281d8432a", "created": "2024-02-29T20:02:11.533434Z", "modified": "2024-02-29T20:02:15.423424Z", "settings": { "summary": { "sql": "select\n toStartOfMinute(timestamp) as minute,\n sum(cost) as sum_cost,\n avg(tax) as avgtax,\n quantile(0.95)(distance) as _95th\nfrom project.parent_table\ngroup by minute\nsettings hdx_primary_key='minute'", "enabled": true, "parent_table": "20f23a3f-d2d5-4657-87d3-0a99ed56bd6c" }, "stream": { "token_list": [], "hot_data_max_age_minutes": 60, "hot_data_max_active_partitions": 12, "hot_data_max_rows_per_partition": 1048576, "hot_data_max_minutes_per_partition": 5, "hot_data_max_open_seconds": 10, "hot_data_max_idle_seconds": 5, "cold_data_max_age_days": 3650, "cold_data_max_active_partitions": 168, "cold_data_max_rows_per_partition": 1048576, "cold_data_max_minutes_per_partition": 30, "cold_data_max_open_seconds": 60, "cold_data_max_idle_seconds": 30 }, "age": { "max_age_days": 0 }, "reaper": { "max_age_days": 1 }, "merge": { "enabled": true, "sql": "SELECT `toStartOfMinute(timestamp)`, _CAST(sumMergeState(_CAST(`sum(cost)`, 'AggregateFunction(sum, Nullable(Float64))')), 'String') AS `sum(cost)`, _CAST(avgMergeState(_CAST(`avg(tax)`, 'AggregateFunction(avg, Nullable(Float64))')), 'String') AS `avg(tax)`, _CAST(quantileMergeState(0.95)(_CAST(`quantile(0.95)(distance)`, 'AggregateFunction(quantile(0.95), Nullable(Float64))')), 'String') AS `quantile(0.95)(distance)` FROM `{STREAM}` GROUP BY `toStartOfMinute(timestamp)` FORMAT HDX SETTINGS hdx_primary_key = 'minute', optimize_fuse_sum_count_avg = 1, optimize_syntax_fuse_functions = 1, hdx_format_json_schema = '[{\"datatype\":{\"format\":\"%Y-%m-%d %H:%M:%S\",\"index\":false,\"primary\":true,\"type\":\"datetime\"},\"name\":\"toStartOfMinute(timestamp)\"},{\"datatype\":{\"denullify\":true,\"index\":false,\"type\":\"string\"},\"name\":\"sum(cost)\"},{\"datatype\":{\"denullify\":true,\"index\":false,\"type\":\"string\"},\"name\":\"avg(tax)\"},{\"datatype\":{\"denullify\":true,\"index\":false,\"type\":\"string\"},\"name\":\"quantile(0.95)(distance)\"}]'" }, "autoingest": [ { "enabled": false, "source": "", "pattern": "", "max_rows_per_partition": 12288000, "max_minutes_per_partition": 60, "max_active_partitions": 50, "dry_run": false, "transform": "52f65fa7-817f-4aaa-aa9f-80201c8188da" } ], "sort_keys": [], "shard_key": null, "max_future_days": 0, "max_request_bytes": 0 }, "url": "https://{hostname}/config/v1/orgs/{org_id}/projects/{project_id}/tables/68af42f6-5479-4b07-b873-4c8281d8432a", "type": "summary", "primary_key": "toStartOfMinute(timestamp)" }
Query the summary table
Use the aliases specified in the summary SQL to query data.
Don't use
SELECT *
This returns the internal intermediate states, not the summary. Always list each alias.
This query uses aliases from previous examples:
SELECT
minute,
sum_cost,
avg_tax,
distance_p95
FROM my.summary_table
WHERE minute BETWEEN '2023-01-01 00:00:00' AND '2023-01-02 00:00:00'
GROUP BY minute
ORDER BY minute
To learn more about the aliases in your summary table, see Describe the Schema.
Group by
is requiredYou must
GROUP BY
a subset of the columns used in the original summary transform.
Cluster load
Summary tables can reduce query load, but increase work for intake peers, merge peers, and indexers. Monitor the cluster's pods when adding summaries to understand the impact.
For more information about scaling, see the Scaling your Cluster page.
Table permissions
RBAC treats summary tables like any other tables. Users need these permissions to modify them:
view_table
add_table
change_table
delete_table
By default, the super_admin
permission has all of these permissions.
See the User Permissions (RBAC) documentation for more information.
Batch Ingest
Starting with version 4.14, batch ingests populate the summary table if you’re loading data into its parent table.
Aggregates of aggregates
Hydrolix stores intermediate data for re-aggregating. For example, you can roll up one-minute segments into hourly ones because intermediate states store what’s needed.
SELECT toStartOfHour(minute) AS hour,
sum_cost,
avg_tax,
distance_p95
FROM (
SELECT
minute,
sum_cost,
avg_tax,
distance_p95
FROM my.summarytable
WHERE minute BETWEEN '2023-01-01 00:00:00' AND '2023-01-02 00:00:00'
GROUP BY minute
)
GROUP BY hour
ORDER BY hour
Describe the schema
Hydrolix automatically creates alias names for common aggregation functions. Run DESCRIBE
to see what’s available.
DESCRIBE TABLE my.summarytable
This shows an example response with table information:
name type default_type default_expression
--------------- ------------------------------------------- ------------- -------------------------------------------------
minute DateTime - -
sum_cost Nullable(UInt64) ALIAS sumMerge(`sum(cost)`)
avg_tax Nullable(Float64) ALIAS avgMerge(`avg(tax)`)
distance_p95 Nullable(Float64) ALIAS quantileMerge(0.95)(`quantile(0.95)(distance)`)
sum(sum(cost)) AggregateFunction(sum, Nullable(UInt64)) - -
avg(avg(tax)) AggregateFunction(avg, Nullable(Float64)) - -
quantile(0.95)(distance) AggregateFunction(quantile(0.95), Nullable(Float64)) - -
Updated about 1 month ago