Aggregating Data

Hydrolix supports aggregation via summary tables. You can use summary tables to store aggregations including min, max, count, and sum. In certain cases, summary tables can store statistical aggregations like percentiles and unique counts.

When you use summary tables, Hydrolix aggregates data with the following steps:

  1. Hydrolix runs a summary SQL transform on input to generate intermediate aggregation data.
  2. Hydrolix writes the output of that SQL statement to a summary table.

Later, queries can use the intermediate aggregation data stored in the summary table to process aggregation queries faster than they could by processing every individual piece of input data.

HTTP Streaming Architecture

With the HTTP Stream API, Hydrolix receives data via HTTP requests. The Stream Head adds the message to a queue for ingest. Summary Peers retrieve messages from the queue, enrich the raw data, and then execute their own aggregation.

1199

HTTP Stream API ingestion workflow

Summary Peers transform data twice:

  1. A duplicate of the enrichment transform applied by Stream Peers to the raw data destined for the parent table.
  2. A second summary transform that generates aggregation output destined for the summary table.
599

The series of transforms that Summary Peers apply to raw data.

Example

Consider raw data input that contains a string column called myColumn.
The summary transform must reference myColumn and set the data type to string unless the enrichment transform cast that column to a different type.

Merge

Summary tables increase the workload of Merge Peers. Whenever a merge operation alters partitions, the Merge Peer must also update the corresponding summary tables.

For more information about merge scaling, see the Merge and Scaling your Cluster pages.

Query

Use the aliases specified in your summary transform to query your data. The following example uses aliases specified in the HTTP Streaming example summary transform:

select 
  time, 
  sumcost, 
  avgtax, 
   _95th 
from my.summarytable 
  where time between '2023-01-01 00:00:00' and '2023-01-02 00:00:00'
group by time 
order by time

You must specify a GROUP BY containing a subset of the columns used in the GROUP BY clause of the summary table's summary transform.

Aggregates of Aggregates

Hydrolix stores summary data in intermediate formats. These intermediate formats don't just store an aggregation result; they also store supplemental data that enables accurate calculation of additional aggregations.

For example, you normally can't calculate a weighted average of two averages. But Hydrolix's intermediate formats also store weighting data for averages, so you can aggregate already-calculated averages.

As a result, you can aggregate the aggregate data stored in your summary tables. For example, the above example stores intermediate data in five minute segments. Intermediate states enable an hourly aggregation of those five minute segments:

select toStartOfHour(time) as hour,
  sumcost, 
  avgtax, 
   _95th 
from
( select 
    time, 
    sumcost, 
    avgtax,
    _95th 
  from my.summarytable 
  where time between '2023-01-01 00:00:00' and '2023-01-02 00:00:00'
  group by time 
)
group by hour
order by hour

Describe the Schema

To make summary data simpler to work with, Hydrolix automatically creates alias names for common aggregation functions. To see a full list of generated aliases, execute a DESCRIBE query on your table:

DESCRIBE TABLE my.summarytable

Query id: 732ac0b2-c6f8-46cd-be29-edfabccdcdd6

┌─name──────────────────────────────────────────────────┬─type─────────────────────────────────────────────────┬─default_type─┬─default_expression────────────────────────────────────────────────────────┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ toStartOfFiveMinute(time)                             │ DateTime                                             │              │                                                                           │         │                  │                │
│ sum(sumcost)                                          │ AggregateFunction(sum, Nullable(UInt64))             │              │                                                                           │         │                  │                │
│ avg(avgtax)                                           │ AggregateFunction(avg, Nullable(Int32))              │              │                                                                           │         │                  │                │
│ quantile(0.95)(arrayElement(map_double, 'double'))    │ AggregateFunction(quantile(0.95), Nullable(double))  │              │                                                                           │         │                  │                │
│ time                                                  │ DateTime                                             │ ALIAS        │ `toStartOfFiveMinute(timestamp)`                                          │         │                  │                │
│ sumcost                                               │ Nullable(UInt64)                                     │ ALIAS        │ sumMerge(`sum(sumcost)`)                                                  │         │                  │                │
│ avgtax                                                │ Nullable(Float64)                                    │ ALIAS        │ avgMerge(`avg(avgtax)`)                                                   │         │                  │                │
│ _95th                                                 │ Nullable(Float64)                                    │ ALIAS        │ quantileMerge(0.95)(`quantile(0.95)(arrayElement(map_double, 'double'))`) │         │                  │                │
└───────────────────────────────────────────────────────┴──────────────────────────────────────────────────────┴──────────────┴───────────────────────────────────────────────────────────────────────────┴─────────┴──────────────────┴────────────────┘
[query-head-64bcccdcc5-qk2km] 2023.01.30 18:29:33.255360 [ 9 ] {732ac0b2-c6f8-46cd-be29-edfabccdcdd6} <Information> executeQuery: Read 8 rows, 1.07 KiB in 0.00077993 sec., 10257 rows/sec., 1.33 MiB/sec.

8 rows in set. Elapsed: 0.271 sec.

query-peer :)