Writing Efficient Queries

The Hydrolix platform and the HDX data format have been specifically designed using a number of characteristics that optimize storage for VERY large, high cardinality datasets all while using a distributed storage medium (such as S3, Google Cloud Storage etc). These optimisations include a columnar data structure, micro-indexes, index and data compression optimisations and automated partitioning and mean that in order to get the most out of the system and to get a truly performant level of service the following should be considered.

Adjusting expectations

With great power, comes great responsibility or in our case by using a system such as Hydrolix you will have a greater potential to access a significantly greater amount of detail across a significantly larger amount of data for a longer period of time than you had previously.

This is important to understand as where your traditional data system is unnaturally limited due to sampling, roll-ups or restrictions in storage longevity (e.g. data can only be kept for 48 hours), these boundaries no longer exist within a Hydrolix deployed system.

We therefore need to adjust our expectations in how we Query data from the system.

Efficient Query Concepts

Hydrolix is built to handle querying massive datasets stored in the cloud. While basic queries should perform well out-of-the-box, there are some query patterns common to OLTP databases (Postgres, MySQL) that can lead to unexpectedly slow or queries consuming too much memory. The good news is that most often these queries can be rewritten to behave in a more performant way, that leverages what Hydrolix and ClickHouse (the underlying SQL engine) are good at.

πŸ‘

When tuning your Hydrolix queries, it's best to keep the following concepts in mind:

  • Your query will at some point be pulling data from the cloud. Limiting how much data and how often your query pulls from the cloud is in your best interest.
  • The data for your query is stored by column, meaning all values that you pull for a particular column are stored contiguously in memory. Operating on columns using ClickHouse functions is almost always faster than operating on rows using traditional SQL methods.
  • A JOIN more well-suited to row-oriented databases. In our system, JOINs (INNER, OUTER, etc) between two or more large sets should be avoided (with an exception for ARRAY JOINs, which we will go into further!)

Cloud Based Columnar Database

The Hydrolix HDX data format is a columnar based data format, which means that data is stored as columns rather than rows. This kind of structure is really efficient when querying data, especially large and very wide datasets, where disparate columns are required to satisfy a query.

In order to be efficient when running queries it is important to only use the columns that are are needed to answer the query.

SELECT
    general.time AS "time",
    general.country AS "country",
    general.network AS "network",
    general.user_perf_cnt_good AS "user_perf_cnt_good",
    general.user_perf_cnt_bad AS "user_perf_cnt_bad",
    general.admin_perf_cnt_good AS "admin_perf_cnt_good",
    general.admin_perf_cnt_good AS "admin_perf_cnt_bad",
    general.primary_network AS "primary_network"
SELECT
    toStartOfDay(timestamp) AS day,
    countryCode,
    average(performance) AS perf
FROM
    (
    SELECT
        *,
    countIf(performance > 10 and user = normal) as slow_normal,
    countIf(performance > 10 and user = admin) as slow_admin,
    
    if
    FROM
        sample.performance_logs
    WHERE
        timestamp BETWEEN '2021-11-01 00:00:00' AND '2021-11-05 00:00:00'
    LIMIT 10000)
WHERE
    countryCode = 4
GROUP BY
    day,
    countryCode
ORDER BY
    timestamp

Time

An important element within Hydrolix is time. Time is core to the partitioning strategy used when storing data, this means that with data-sets its important to limit by time and limit often. Without limiting by time you may end up scanning all the data from all time when you are only really interested in the last 10 minutes.

select *

select * and using wildcarding in queries is a significantly inefficient method of executing queries, this is because a wildcard request will retrieve all columns across the whole dataset regardless if that column is needed or not. For example if you were to run the following query:

In the above, the nested query effectively requests all the columns between two dates, then gets the first 10,000 rows and passes this to the top Select statement for processing. This second statement runs another filter, where countryCode=4, on the data.

If the dataset's width is a 100 columns wide this can cause a huge amount of data to be retrieved to start. A more optimal approach would be to name the columns in the nested query, move the predicate in the top query down to reduce the :

SELECT
    toStartOfDay(timestamp) AS day,
    countryCode,
    average(performance) AS perf
FROM
    (
    SELECT
        timestamp,
        countryCode,
        performance
    FROM
        sample.performance_logs
    WHERE
        timestamp BETWEEN '2021-11-01 00:00:00' AND '2021-11-05 00:00:00'
    AND
    countryCode = 4
    LIMIT 10000)
GROUP BY
    day,
    countryCode
ORDER BY
    timestamp

Don't rely on Limit

Relying on Limit for tasks can be dangerous. LIMIT does not actually limit the number of rows "retrieved" from storage, it limits the number of rows that are provided in the response to the end-user. For example users often use this to get an understanding of a table:

SELECT
        *
FROM
        sample.performance_logs
LIMIT 1

This query will effectively request ALL rows (and all columns) from the database from all time from storage and then once all are retrieved issue the 1st row in the set.

A better method to get information on the table is to use the DESC command:

DESC sample.performance_logs

Or if a number of rows are required limit the time window that

SELECT
        *
FROM
        sample.performance_logs
WHERE
        timestamp BETWEEN '2021-11-01 00:00:00' and '2021-11-01 00:00:01'

Predicates (other than Time)

Hydrolix uses a unique indexing strategy, where by default columns are indexed (often, even metrics). This means that by using predicates (and predicate push-down) smaller amounts of data can be selected to be retrieved from the cloud storage mechanism.

To write fast queries, the more predicates ( where column = x etc) you can use the faster your query will be as it ensures only the data required to answer is actually retrieved from storage.

Circuit Breakers

Hydrolix has the ability to use Circuit breakers that can protect the system from inefficient queries, more information on these can be found here Query Options


Did this page help you?