Query Efficiency
The Hydrolix platform and the HDX data format have been specifically designed using a number of characteristics that optimize storage for very large, high cardinality datasets while using a distributed storage medium (such as S3, Google Cloud Storage etc). These optimizations include a columnar data structure, micro-indexes, tight index and data compression and automated partitioning. In order to get the most out of the system and to get a truly performant level of service, the following should be considered.
Adjusting expectations
With great power comes great responsibility. In our case, by using a system such as Hydrolix you will have a greater potential to access a significantly greater amount of detail across a significantly larger amount of data for a longer period of time than you had previously.
Traditional data systems are unnaturally limited due to sampling, roll-ups, or restrictions in storage retention (e.g. data can only be kept for 48 hours). These boundaries no longer exist within a Hydrolix deployed system.
We therefore need to adjust our expectations in how we Query data from the system.
Efficient Query Concepts
Hydrolix is built to handle querying massive datasets stored in the cloud. While basic queries should perform well out-of-the-box, there are some query patterns common to OLTP databases (e.g. Postgres, MySQL) that can lead to unexpectedly slow or queries consuming too much memory. The good news is that most often these queries can be rewritten to behave in a more performant way that highlights what Hydrolix and ClickHouse (the underlying SQL engine) are good at.
When tuning your Hydrolix queries, it's best to keep the following concepts in mind:
- Your query will be pulling data from the cloud. Limiting how much data and how often your query pulls from the cloud is in your best interest.
- The data for your query is stored by column, meaning all values that you pull for a particular column are stored contiguously in memory. Operating on columns using ClickHouse functions is almost always faster than operating on rows using traditional SQL methods.
- A
JOIN
is more well-suited to row-oriented databases. In our system,JOIN
s (INNER
,OUTER
, etc) between two or more large sets should be avoided (with an exception forARRAY JOIN
s, which we will discuss below.)
Cloud Based Columnar Database
The Hydrolix HDX data format is a columnar-based data format, which means that data is stored as columns rather than rows. This kind of structure is more efficient when querying large and very wide datasets where disparate columns are required to satisfy a query.
In order to be efficient when running queries, it is important to use only the columns that are are needed to answer the query.
Time
An important element within Hydrolix is time. Time is core to the partitioning strategy used when storing data. This means that it is important to limit by time and limit often. Without limiting by time, you may end up scanning all the data from all time when you are only really interested in the last 10 minutes.
select *
select *
select *
and using wildcards in queries is a significantly inefficient method of executing queries. This is because a wildcard request will retrieve all columns across the whole dataset, regardless if that column is needed or not. For example, if you were to run the following query:
SELECT
toStartOfDay(timestamp) AS day,
countryCode,
avg(performance) AS perf
FROM
(
SELECT
*,
countIf(performance > 10 and user = 'normal') as slow_normal,
countIf(performance > 10 and user = 'admin') as slow_admin
FROM
sample.performance_logs
WHERE
timestamp BETWEEN '2021-11-01 00:00:00' AND '2021-11-05 00:00:00'
LIMIT 10000) AS subquery
WHERE
countryCode = 4
GROUP BY
day,
countryCode,
perf
ORDER BY
day
In the above, the nested query requests all the columns between two dates, then gets the first 10,000 rows and passes this to the top SELECT statement for processing. The top statement then runs another filter, WHERE countryCode=4
.
If the dataset's width was 100 columns wide, this SELECT *
could cause a huge amount of unused data to be retrieved. A better approach would be to name only the columns needed in the nested query and move the countrycode = 4
predicate in the top query into the nested query for more efficiency:
SELECT
toStartOfDay(timestamp) AS day,
countryCode,
avg(performance) AS perf
FROM
(
SELECT
timestamp,
countryCode,
performance
FROM
sample.performance_logs
WHERE
timestamp BETWEEN '2021-11-01 00:00:00' AND '2021-11-05 00:00:00'
AND
countryCode = 4
LIMIT 10000) AS subquery
GROUP BY
day,
countryCode
ORDER BY
day
Don't rely on Limit
Limit
Relying on LIMIT for tasks can be dangerous. Very simple queries can use LIMIT to keep the database from querying more rows than you need. However, if your query includes GROUP BY, ORDER BY, or any aggregate function, LIMIT does not actually limit the number of rows retrieved from storage. Instead, it just limits the number of rows that are provided in the final response to you. For example, this query will retrieve all rows in order to process the GROUP BY clause:
SELECT
*
FROM
sample.performance_logs
GROUP BY
error_level
LIMIT
1
This query will request ALL rows (and all columns) from the database from all time from storage and then, once all are retrieved, return the first row in the set.
It's much better to limit the time window, as well as the number of rows returned:
SELECT
*
FROM
sample.performance_logs
WHERE
timestamp BETWEEN '2024-11-01 00:00:00' and '2024-11-01 00:00:01'
GROUP BY
error_level
LIMIT
1
Use Predicates
Hydrolix uses a unique indexing strategy, where all columns are indexed by default. This even includes columns considered metrics. Speed up your queries by using predicates (and predicate push-down) -- smaller amounts of data will be selected to be retrieved from cloud storage.
The more predicates (e.g. where column = x
) you can use, the faster your query will be, ensuring only the data required to answer is actually retrieved from storage.
Use Query Caching
When using SETTINGS use_query_cache = true
at the end of queries, Hydrolix will cache the results of the query for future use. Subsequent query results issued within a predetermined amount of time will be served from a local cache, rather than from cloud storage. Read more about this feature in the ClickHouse documentation.
Using the example query from above,
SELECT * FROM test_parameterized(cost_limit = 65518)
a test cluster shows this much elapsed time:
However, appending SETTINGS use_query_cache = true
to the query drastically reduces the query response time if the results are in the cache:
Allow Null Primary Key Query Results
By default, the primary key column in a Hydrolix table cannot be null. Hydrolix rejects any data that contains a null primary key.
However, querying the primary key column with certain functions can can still generate a null result in that column. When this happens, Hydrolix converts that null value to the equivalent of 0 in the datetime
format -- Unix epoch time zero, e.g. 1970-01-01 00:00:00
.
Use the toNull()
function to allow null results in the primary key column:
SELECT MAX(toNull(mytimestamp)) FROM x WHERE mytimestamp > now()
Circuit Breakers
Hydrolix has the ability to use Circuit breakers that can protect the system from inefficient queries, more information on these can be found here Query Options
Updated 18 days ago