Summary Tables
Overview⚓︎
Aggregating data before it's queried can decrease query latency and reduce system load. Hydrolix supports aggregation with summary tables. Use summary tables to store aggregations including minimum, maximum, count, and sum. In certain cases, summary tables can store statistical aggregations like unique counts and quantiles/percentiles.
When you use summary tables, Hydrolix aggregates data with the following steps:
- Hydrolix runs a summary SQL transform on raw table partitions to produce intermediate aggregation data.
- Hydrolix writes the output of that SQL statement to a summary table.
Queries can use the intermediate aggregation data stored in the summary table to process aggregation queries faster than they could by processing every individual piece of input data.
More than one summary can exist on a parent table. For example, the diagram below shows a table that provides raw data for both minute and hourly summary tables:
Alias columns⚓︎
Alias columns in the view schema can't be used with summary tables. See Alias Columns for more information.
Summary transform structure⚓︎
The summary transform defines a SQL statement that aggregates data over a period of time.
This example shows a general form for summary transforms:
Summary transforms are subject to the following limitations:
- In the
SETTINGSclause, you must specify thehdx_primary_key. Thehdx_primary_keymust match a SQL aggregation alias in yourSELECTstatement. This is a query-time alias, not an alias column defined in a view. - In the
FROMclause, specify the project name and the table name of the parent table.
Summary transform example⚓︎
In this example, the summary transform aggregates e-commerce data, minute by minute. The transform calculates:
- The sum of item cost
- The average tax
- The 95th percentile of distance
This example shows one way to define the summary:
Limitations during transform⚓︎
- When you create a summary transform, the API rejects any non-deterministic summary SQL clauses.
- Don't use functions like
now(),today(), orcurrent_timestamp()in aWHEREclause. This adds data that isn't in the source table. - It's safe to operate on timestamps used in the example, with
toStartOfMinute(timestamp)or withINTERVALmath. - Window functions like
PARTITION BYorOVERare also non-deterministic in the context of summary transform execution. They need access to all the underlying data for the window, which isn't guaranteed in distributed partitions, so window functions aren't supported in summary SQL. - Aggregation functions in memory can return numeric types
uint16orint16, which are not supported by Hydrolix storage. When this occurs, these types are automatically promoted to the correspondinguint32orint32.
Exclude columns from indexing⚓︎
By default, a summary table indexes all non-aggregate columns.
To exclude columns from indexing, add this line to the transform SQL:
hdx_summary_override_indexes = 'column_1,column_2,column_3'
In this example, the sum_cost and avg_tax columns are not indexed.
Create the summary table⚓︎
You can create, update, and delete summary tables in Hydrolix by using the Hydrolix UI or the config API table management methods.
Do not use dictionaries in summary tables
Using dictionaries in summary tables breaks the Merge service. To include dictionary data in your summary table, do the lookup in the raw table, store the results as a field, then use that field in the summary. See Getting Started with Dictionary for an example of how to incorporate dictionary values into a transform.
Create a summary table with the UI⚓︎
To create a summary table with the Hydrolix UI:
- In the top right of the Hydrolix UI, click + Add New.
- In the right sidebar, select Summary Table.
- Fill out the form:
- Select the project for the table
- Enter the table name and description
- Paste the summary SQL into the Summary SQL field. Be sure to include the
SETTINGSclause.
- Click Create table.
Create a summary table with API⚓︎
You can use tools like curl, Postman, or HTTPie to make the request.
-
Log in to the API to get your bearer token. For example, with cURL:
This command assumes you've set the $HDX_HOSTNAME, $HDX_USER and
$HDX_PASSWORD environment variables.
-
Create the table using
POST:
=== "Sample cURL POST"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
=== "Sample Response"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | |
Query the summary table⚓︎
Use the aliases specified in the summary SQL to query data.
Don't use SELECT *
This returns the internal intermediate states, not the summary. Always list each alias.
This query uses aliases from previous examples:
To learn more about the aliases in your summary table, see Describe the Schema.
Group by is required
You must GROUP BY a subset of the columns used in the original summary transform.
Cluster load⚓︎
Summary tables can reduce query load, but increase work for intake peers, merge peers, and indexers. Monitor the cluster's pods when adding summaries to understand the impact.
For more information about scaling, see the Scaling your Cluster page.
Table permissions⚓︎
RBAC treats summary tables like any other tables. Users need these permissions to modify them:
view_tableadd_tablechange_tabledelete_table
By default, the super_admin permission has all of these permissions.
See the User Permissions (RBAC) documentation for more information.
Batch Ingest⚓︎
Starting with version 4.14, batch ingests populate the summary table if you’re loading data into its parent table.
Aggregates of aggregates⚓︎
Hydrolix stores intermediate data for re-aggregating. For example, you can roll up one-minute segments into hourly ones because intermediate states store what’s needed.
Describe the schema⚓︎
Hydrolix automatically creates alias names for common aggregation functions. Run DESCRIBE to see what’s available.
This shows an example response with table information: