Summary Tables

Overview⚓︎

Aggregating data before it's queried can decrease query latency and reduce system load. Hydrolix supports aggregation with summary tables. Use summary tables to store aggregations including minimum, maximum, count, and sum. In certain cases, summary tables can store statistical aggregations like unique counts and quantiles/percentiles.

When you use summary tables, Hydrolix aggregates data with the following steps:

Hydrolix runs a summary SQL transform on raw table partitions to produce intermediate aggregation data.
Hydrolix writes the output of that SQL statement to a summary table.

Queries can use the intermediate aggregation data stored in the summary table to process aggregation queries faster than they could by processing every individual piece of input data.

Simple Summary Table Setup Diagram

More than one summary can exist on a parent table. For example, the diagram below shows a table that provides raw data for both minute and hourly summary tables:

Diagram of two summary tables being created from one parent table

Calculated columns⚓︎

Calculated columns in the view schema can't be used with summary tables. See Calculated Columns for more information.

Summary transform structure⚓︎

The summary transform defines a SQL statement that aggregates data over a period of time.

This example shows a general form for summary transforms:

SELECT
   <time_field_aggregation> AS <alias>,
   <function_on_column> AS <alias>,
   .......
FROM
    <parent_project.parent_table>
GROUP BY 
        <time_field_alias>
SETTINGS 
        hdx_primary_key = '<time_field_alias>'

Summary transforms are subject to the following limitations:

In the SETTINGS clause, you must specify the hdx_primary_key. The hdx_primary_key must match a SQL aggregation alias in your SELECT statement. This is a query-time alias, not an alias column defined in a view.
In the FROM clause, specify the project name and the table name of the parent table.

Summary transform example⚓︎

In this example, the summary transform aggregates e-commerce data, minute by minute. The transform calculates:

The sum of item cost
The average tax
The 95th percentile of distance

This example shows one way to define the summary:

SELECT
    toStartOfMinute(timestamp) AS minute,
    sum(cost) AS sum_cost,
    avg(tax) AS avg_tax,
    quantile(0.95)(distance) AS distance_p95
FROM project.parent_table
GROUP BY minute
SETTINGS hdx_primary_key = 'minute'

Limitations during transform⚓︎

When you create a summary transform, the API rejects any non-deterministic summary SQL clauses.
Don't use functions like now(), today(), or current_timestamp() in a WHERE clause. This adds data that isn't in the source table.
It's safe to operate on timestamps used in the example, with toStartOfMinute(timestamp) or with INTERVAL math.
Window functions like PARTITION BY or OVER are also non-deterministic in the context of summary transform execution. They need access to all the underlying data for the window, which isn't guaranteed in distributed partitions, so window functions aren't supported in summary SQL.
Aggregation functions in memory can return numeric types uint16 or int16, which are not supported by Hydrolix storage. When this occurs, these types are automatically promoted to the corresponding uint32 or int32.

Exclude columns from indexing⚓︎

By default, a summary table indexes all non-aggregate columns.

To exclude columns from indexing, add this line to the transform SQL:

hdx_summary_override_indexes = 'column_1,column_2,column_3'

In this example, the sum_cost and avg_tax columns are not indexed.

hdx_summary_override_indexes

SELECT
    toStartOfMinute(timestamp) AS minute,
    sum(cost) AS sum_cost,
    avg(tax) AS avg_tax,
    quantile(0.95)(distance) AS distance_p95
FROM project.parent_table
GROUP BY minute
SETTINGS hdx_primary_key = 'minute',
         hdx_summary_override_indexes = 'sum_cost,avg_tax'

Create the summary table⚓︎

You can create, update, and delete summary tables in Hydrolix by using the Hydrolix UI or the config API table management methods.

Do not use dictionaries in summary tables

Using dictionaries in summary tables breaks the Merge service. To include dictionary data in your summary table, do the lookup in the raw table, store the results as a field, then use that field in the summary. See Getting Started with Dictionary for an example of how to incorporate dictionary values into a transform.

Create a summary table with the UI⚓︎

To create a summary table with the Hydrolix UI:

In the top right of the Hydrolix UI, click + Add New.
In the right sidebar, select Summary Table.
Fill out the form:
- Select the project for the table
- Enter the table name and description
- Paste the summary SQL into the Summary SQL field. Be sure to include the SETTINGS clause.
Click Create table.

Create a summary table with API⚓︎

You can use tools like curl, Postman, or HTTPie to make the request.

Log in to the API to get your bearer token. For example, with cURL:

cURL

   export HDX_TOKEN=$(
     curl -X POST -H "Content-Type: application/json" \
     https://$HDX_HOSTNAME/config/v1/login/ \
     -d "{\"username\":\"$HDX_USER\",\"password\":\"$HDX_PASSWORD\"}" \
     | jq -r ".auth_token.access_token"
   )

This command assumes you've set the $HDX_HOSTNAME, $HDX_USER and $HDX_PASSWORD environment variables.

Create the table using POST:

POST Body

   {
       "name": "my.summarytable",
       "description": "Minute-by-minute summary of parent table",
       "type": "summary",
       "settings": {
           "merge": {
               "enabled": true
           },
           "summary": {
               "enabled": true,
               "sql": "SELECT toStartOfMinute(timestamp) AS minute,
   sum(cost) AS sum_cost, avg(tax) AS avg_tax, quantile(0.95)(distance)
   AS distance_p95 FROM project.parent_table GROUP BY minute SETTINGS
   hdx_primary_key='minute'"
           }
       }
   }

=== "Sample cURL POST"

   ```curl
      curl --request POST \
      --url https://$HDX_HOSTNAME/config/v1/orgs/$HDX_ORG/projects/$HDX_PROJECT/tables/ \
      --header "Authorization: Bearer $HDX_TOKEN" \
      --header "Accept: application/json" \
      --header "Content-Type: application/json" \
      --data "
      {
      \"name\": \"my.summarytable\",
      \"description\": \"Minute by minute summary of parent table\",
      \"type\": \"summary\",
      \"settings\": {
          \"merge\": { \"enabled\": true },
          \"summary\": {
          \"enabled\": true,
          \"sql\": \"SELECT toStartOfMinute(timestamp) AS minute, sum(cost) AS sum_cost, avg(tax) AS avg_tax, quantile(0.95)(distance) AS distance_p95 FROM project.parent_table GROUP BY minute SETTINGS hdx_primary_key='minute'\"
          }
      }
      }
      "
   ```

=== "Sample Response"

   ```json
      {
          "project": "{project_id}",
          "name": "my.summarytable",
          "description": "Minute-by-minute summary of parent table",
          "uuid": "68af42f6-5479-4b07-b873-4c8281d8432a",
          "created": "2024-02-29T20:02:11.533434Z",
          "modified": "2024-02-29T20:02:15.423424Z",
          "settings": {
              "summary": {
                  "sql": "select\n    toStartOfMinute(timestamp) as minute,\n    sum(cost) as sum_cost,\n    avg(tax) as avgtax,\n    quantile(0.95)(distance) as _95th\nfrom project.parent_table\ngroup by minute\nsettings hdx_primary_key='minute'",
                  "enabled": true,
                  "parent_table": "20f23a3f-d2d5-4657-87d3-0a99ed56bd6c"
              },
              "stream": {
                  "token_list": [],
                  "hot_data_max_age_minutes": 60,
                  "hot_data_max_active_partitions": 12,
                  "hot_data_max_rows_per_partition": 1048576,
                  "hot_data_max_minutes_per_partition": 5,
                  "hot_data_max_open_seconds": 10,
                  "hot_data_max_idle_seconds": 5,
                  "cold_data_max_age_days": 3650,
                  "cold_data_max_active_partitions": 168,
                  "cold_data_max_rows_per_partition": 1048576,
                  "cold_data_max_minutes_per_partition": 30,
                  "cold_data_max_open_seconds": 60,
                  "cold_data_max_idle_seconds": 30
              },
              "age": {
                  "max_age_days": 0
              },
              "reaper": {
                  "max_age_days": 1
              },
              "merge": {
                  "enabled": true,
                  "sql": "SELECT `toStartOfMinute(timestamp)`, _CAST(sumMergeState(_CAST(`sum(cost)`, 'AggregateFunction(sum, Nullable(Float64))')), 'String') AS `sum(cost)`, _CAST(avgMergeState(_CAST(`avg(tax)`, 'AggregateFunction(avg, Nullable(Float64))')), 'String') AS `avg(tax)`, _CAST(quantileMergeState(0.95)(_CAST(`quantile(0.95)(distance)`, 'AggregateFunction(quantile(0.95), Nullable(Float64))')), 'String') AS `quantile(0.95)(distance)` FROM `{STREAM}` GROUP BY `toStartOfMinute(timestamp)` FORMAT HDX SETTINGS hdx_primary_key = 'minute', optimize_fuse_sum_count_avg = 1, optimize_syntax_fuse_functions = 1, hdx_format_json_schema = '[{\"datatype\":{\"format\":\"%Y-%m-%d %H:%M:%S\",\"index\":false,\"primary\":true,\"type\":\"datetime\"},\"name\":\"toStartOfMinute(timestamp)\"},{\"datatype\":{\"denullify\":true,\"index\":false,\"type\":\"string\"},\"name\":\"sum(cost)\"},{\"datatype\":{\"denullify\":true,\"index\":false,\"type\":\"string\"},\"name\":\"avg(tax)\"},{\"datatype\":{\"denullify\":true,\"index\":false,\"type\":\"string\"},\"name\":\"quantile(0.95)(distance)\"}]'"
              },
              "autoingest": [
                  {
                      "enabled": false,
                      "source": "",
                      "pattern": "",
                      "max_rows_per_partition": 12288000,
                      "max_minutes_per_partition": 60,
                      "max_active_partitions": 50,
                      "dry_run": false,
                      "transform": "52f65fa7-817f-4aaa-aa9f-80201c8188da"
                  }
              ],
              "sort_keys": [],
              "shard_key": null,
              "max_future_days": 0,
              "max_request_bytes": 0
          },
          "url": "https://{hostname}/config/v1/orgs/{org_id}/projects/{project_id}/tables/68af42f6-5479-4b07-b873-4c8281d8432a",
          "type": "summary",
          "primary_key": "toStartOfMinute(timestamp)"
      }
   ```

Query the summary table⚓︎

Use the aliases specified in the summary SQL to query data.

Don't use SELECT *

This returns the internal intermediate states, not the summary. Always list each alias.

This query uses aliases from previous examples:

SELECT
  minute,
  sum_cost,
  avg_tax,
  distance_p95
FROM my.summary_table
WHERE minute BETWEEN '2023-01-01 00:00:00' AND '2023-01-02 00:00:00'
GROUP BY minute
ORDER BY minute

To learn more about the aliases in your summary table, see Describe the Schema.

Group by is required

You must GROUP BY a subset of the columns used in the original summary transform.

Cluster load⚓︎

Summary tables can reduce query load, but increase work for intake peers, merge peers, and indexers. Monitor the cluster's pods when adding summaries to understand the impact.

For more information about scaling, see the Scaling your Cluster page.

Table permissions⚓︎

RBAC treats summary tables like any other tables. Users need these permissions to modify them:

view_table
add_table
change_table
delete_table

By default, the super_admin permission has all of these permissions.

See the User Permissions (RBAC) documentation for more information.

Batch Ingest⚓︎

Starting with version 4.14, batch ingests populate the summary table if you’re loading data into its parent table.

Aggregates of aggregates⚓︎

Hydrolix stores intermediate data for re-aggregating. For example, you can roll up one-minute segments into hourly ones because intermediate states store what’s needed.

SELECT toStartOfHour(minute) AS hour,
  sum_cost,
  avg_tax,
  distance_p95
FROM (
  SELECT
    minute,
    sum_cost,
    avg_tax,
    distance_p95
  FROM my.summarytable
  WHERE minute BETWEEN '2023-01-01 00:00:00' AND '2023-01-02 00:00:00'
  GROUP BY minute
)
GROUP BY hour
ORDER BY hour

Describe the schema⚓︎

Hydrolix automatically creates alias names for common aggregation functions. Run DESCRIBE to see what’s available.

Describe table

DESCRIBE TABLE my.summarytable

This shows an example response with table information:

Example table response

name             type                                         default_type   default_expression
---------------  -------------------------------------------  -------------  -------------------------------------------------
minute           DateTime                                     -              -
sum_cost         Nullable(UInt64)                             ALIAS          sumMerge(`sum(cost)`)
avg_tax          Nullable(Float64)                            ALIAS          avgMerge(`avg(tax)`)
distance_p95     Nullable(Float64)                            ALIAS          quantileMerge(0.95)(`quantile(0.95)(distance)`)
sum(sum(cost))   AggregateFunction(sum, Nullable(UInt64))     -              -
avg(avg(tax))    AggregateFunction(avg, Nullable(Float64))    -              -
quantile(0.95)(distance) AggregateFunction(quantile(0.95), Nullable(Float64)) -              -