Query Efficiency

The Hydrolix platform and the HDX data format have been specifically designed using a number of characteristics that optimize storage for very large, high cardinality datasets while using a distributed storage medium (such as S3, Google Cloud Storage etc). These optimizations include a columnar data structure, micro-indexes, tight index and data compression and automated partitioning. In order to get the most out of the system and to get a truly performant level of service, the following should be considered.

Adjusting expectations

With great power comes great responsibility. In our case, by using a system such as Hydrolix you will have a greater potential to access a significantly greater amount of detail across a significantly larger amount of data for a longer period of time than you had previously.

Traditional data systems are unnaturally limited due to sampling, roll-ups, or restrictions in storage retention (e.g. data can only be kept for 48 hours). These boundaries no longer exist within a Hydrolix deployed system.

We therefore need to adjust our expectations in how we Query data from the system.

Efficient Query Concepts

Hydrolix is built to handle querying massive datasets stored in the cloud. While basic queries should perform well out-of-the-box, there are some query patterns common to OLTP databases (e.g. Postgres, MySQL) that can lead to unexpectedly slow or queries consuming too much memory. The good news is that most often these queries can be rewritten to behave in a more performant way that highlights what Hydrolix and ClickHouse (the underlying SQL engine) are good at.

👍

When tuning your Hydrolix queries, it's best to keep the following concepts in mind:

  • Your query will be pulling data from the cloud. Limiting how much data and how often your query pulls from the cloud is in your best interest.
  • The data for your query is stored by column, meaning all values that you pull for a particular column are stored contiguously in memory. Operating on columns using ClickHouse functions is almost always faster than operating on rows using traditional SQL methods.
  • A JOIN is more well-suited to row-oriented databases. In our system, JOINs (INNER, OUTER, etc) between two or more large sets should be avoided (with an exception for ARRAY JOINs, which we will discuss below.)

Cloud Based Columnar Database

The Hydrolix HDX data format is a columnar-based data format, which means that data is stored as columns rather than rows. This kind of structure is more efficient when querying large and very wide datasets where disparate columns are required to satisfy a query.

In order to be efficient when running queries, it is important to use only the columns that are are needed to answer the query.

Time

An important element within Hydrolix is time. Time is core to the partitioning strategy used when storing data. This means that it is important to limit by time and limit often. Without limiting by time, you may end up scanning all the data from all time when you are only really interested in the last 10 minutes.

select *

select * and using wildcards in queries is a significantly inefficient method of executing queries. This is because a wildcard request will retrieve all columns across the whole dataset, regardless if that column is needed or not. For example, if you were to run the following query:

SELECT
	toStartOfDay(timestamp) AS day,
	countryCode,
	avg(performance) AS perf
FROM
	(
	SELECT
		*,
    countIf(performance > 10 and user = 'normal') as slow_normal,
    countIf(performance > 10 and user = 'admin') as slow_admin
	FROM
		sample.performance_logs
	WHERE
		timestamp BETWEEN '2021-11-01 00:00:00' AND '2021-11-05 00:00:00'
	LIMIT 10000) AS subquery
WHERE
	countryCode = 4
GROUP BY
	day,
	countryCode,
  perf
ORDER BY
	day

In the above, the nested query requests all the columns between two dates, then gets the first 10,000 rows and passes this to the top SELECT statement for processing. The top statement then runs another filter, WHERE countryCode=4.

If the dataset's width was 100 columns wide, this SELECT *could cause a huge amount of unused data to be retrieved. A better approach would be to name only the columns needed in the nested query and move the countrycode = 4predicate in the top query into the nested query for more efficiency:

SELECT
	toStartOfDay(timestamp) AS day,
	countryCode,
	avg(performance) AS perf
FROM
	(
	SELECT
		timestamp,
		countryCode,
		performance
	FROM
		sample.performance_logs
	WHERE
		timestamp BETWEEN '2021-11-01 00:00:00' AND '2021-11-05 00:00:00'
    AND
    countryCode = 4
	LIMIT 10000) AS subquery
GROUP BY
	day,
	countryCode
ORDER BY
	day

Don't rely on Limit

Relying on LIMIT for tasks can be dangerous. Very simple queries can use LIMIT to keep the database from querying more rows than you need. However, if your query includes GROUP BY, ORDER BY, or any aggregate function, LIMIT does not actually limit the number of rows retrieved from storage. Instead, it just limits the number of rows that are provided in the final response to you. For example, this query will retrieve all rows in order to process the GROUP BY clause:

SELECT
  *
FROM
  sample.performance_logs
GROUP BY
  error_level
LIMIT 
  1

This query will request ALL rows (and all columns) from the database from all time from storage and then, once all are retrieved, return the first row in the set.

It's much better to limit the time window, as well as the number of rows returned:

SELECT
  *
FROM
  sample.performance_logs
WHERE
  timestamp BETWEEN '2024-11-01 00:00:00' and '2024-11-01 00:00:01'
GROUP BY
  error_level
LIMIT
  1

Use Predicates

Hydrolix uses a unique indexing strategy, where all columns are indexed by default. This even includes columns considered metrics. Speed up your queries by using predicates (and predicate push-down) -- smaller amounts of data will be selected to be retrieved from cloud storage.

The more predicates (e.g. where column = x) you can use, the faster your query will be, ensuring only the data required to answer is actually retrieved from storage.

Use Query Caching

When using SETTINGS use_query_cache = true at the end of queries, Hydrolix will cache the results of the query for future use. Subsequent query results issued within a predetermined amount of time will be served from a local cache, rather than from cloud storage. Read more about this feature in the ClickHouse documentation.

Using the example query from above,

SELECT * FROM test_parameterized(cost_limit = 65518)

a test cluster shows this much elapsed time:

Longer Query

However, appending SETTINGS use_query_cache = true to the query drastically reduces the query response time if the results are in the cache:

Shorter Query

Allow Null Primary Key Query Results

By default, the primary key column in a Hydrolix table cannot be null. Hydrolix rejects any data that contains a null primary key.

However, querying the primary key column with certain functions can can still generate a null result in that column. When this happens, Hydrolix converts that null value to the equivalent of 0 in the datetime format -- Unix epoch time zero, e.g. 1970-01-01 00:00:00.

Use the toNull() function to allow null results in the primary key column:

SELECT MAX(toNull(mytimestamp)) FROM x WHERE mytimestamp > now()

Circuit Breakers

Hydrolix has the ability to use Circuit breakers that can protect the system from inefficient queries, more information on these can be found here Query Options