Writing Efficient Queries
The Hydrolix platform and the HDX data format have been specifically designed using a number of characteristics that optimize storage for VERY large, high cardinality datasets all while using a distributed storage medium (such as S3, Google Cloud Storage etc). These optimisations include a columnar data structure, micro-indexes, index and data compression optimisations and automated partitioning and mean that in order to get the most out of the system and to get a truly performant level of service the following should be considered.
With great power, comes great responsibility or in our case by using a system such as Hydrolix you will have a greater potential to access a significantly greater amount of detail across a significantly larger amount of data for a longer period of time than you had previously.
This is important to understand as where your traditional data system is unnaturally limited due to sampling, roll-ups or restrictions in storage longevity (e.g. data can only be kept for 48 hours), these boundaries no longer exist within a Hydrolix deployed system.
We therefore need to adjust our expectations in how we Query data from the system.
Efficient Query Concepts
Hydrolix is built to handle querying massive datasets stored in the cloud. While basic queries should perform well out-of-the-box, there are some query patterns common to OLTP databases (Postgres, MySQL) that can lead to unexpectedly slow or queries consuming too much memory. The good news is that most often these queries can be rewritten to behave in a more performant way, that leverages what Hydrolix and ClickHouse (the underlying SQL engine) are good at.
When tuning your Hydrolix queries, it's best to keep the following concepts in mind:
- Your query will at some point be pulling data from the cloud. Limiting how much data and how often your query pulls from the cloud is in your best interest.
- The data for your query is stored by column, meaning all values that you pull for a particular column are stored contiguously in memory. Operating on columns using ClickHouse functions is almost always faster than operating on rows using traditional SQL methods.
JOINmore well-suited to row-oriented databases. In our system,
OUTER, etc) between two or more large sets should be avoided (with an exception for
ARRAY JOINs, which we will go into further!)
Cloud Based Columnar Database
The Hydrolix HDX data format is a columnar based data format, which means that data is stored as columns rather than rows. This kind of structure is really efficient when querying data, especially large and very wide datasets, where disparate columns are required to satisfy a query.
In order to be efficient when running queries it is important to only use the columns that are are needed to answer the query.
SELECT general.time AS "time", general.country AS "country", general.network AS "network", general.user_perf_cnt_good AS "user_perf_cnt_good", general.user_perf_cnt_bad AS "user_perf_cnt_bad", general.admin_perf_cnt_good AS "admin_perf_cnt_good", general.admin_perf_cnt_good AS "admin_perf_cnt_bad", general.primary_network AS "primary_network" SELECT toStartOfDay(timestamp) AS day, countryCode, average(performance) AS perf FROM ( SELECT *, countIf(performance > 10 and user = normal) as slow_normal, countIf(performance > 10 and user = admin) as slow_admin, if FROM sample.performance_logs WHERE timestamp BETWEEN '2021-11-01 00:00:00' AND '2021-11-05 00:00:00' LIMIT 10000) WHERE countryCode = 4 GROUP BY day, countryCode ORDER BY timestamp
An important element within Hydrolix is time. Time is core to the partitioning strategy used when storing data, this means that with data-sets its important to limit by time and limit often. Without limiting by time you may end up scanning all the data from all time when you are only really interested in the last 10 minutes.
select * and using wildcarding in queries is a significantly inefficient method of executing queries, this is because a wildcard request will retrieve all columns across the whole dataset regardless if that column is needed or not. For example if you were to run the following query:
In the above, the nested query effectively requests all the columns between two dates, then gets the first 10,000 rows and passes this to the top Select statement for processing. This second statement runs another filter,
where countryCode=4, on the data.
If the dataset's width is a 100 columns wide this can cause a huge amount of data to be retrieved to start. A more optimal approach would be to name the columns in the nested query, move the predicate in the top query down to reduce the :
SELECT toStartOfDay(timestamp) AS day, countryCode, average(performance) AS perf FROM ( SELECT timestamp, countryCode, performance FROM sample.performance_logs WHERE timestamp BETWEEN '2021-11-01 00:00:00' AND '2021-11-05 00:00:00' AND countryCode = 4 LIMIT 10000) GROUP BY day, countryCode ORDER BY timestamp
Don't rely on
Relying on Limit for tasks can be dangerous. LIMIT does not actually limit the number of rows "retrieved" from storage, it limits the number of rows that are provided in the response to the end-user. For example users often use this to get an understanding of a table:
SELECT * FROM sample.performance_logs LIMIT 1
This query will effectively request ALL rows (and all columns) from the database from all time from storage and then once all are retrieved issue the 1st row in the set.
A better method to get information on the table is to use the
Or if a number of rows are required limit the time window that
SELECT * FROM sample.performance_logs WHERE timestamp BETWEEN '2021-11-01 00:00:00' and '2021-11-01 00:00:01'
Predicates (other than Time)
Hydrolix uses a unique indexing strategy, where by default columns are indexed (often, even metrics). This means that by using predicates (and predicate push-down) smaller amounts of data can be selected to be retrieved from the cloud storage mechanism.
To write fast queries, the more predicates (
where column = x etc) you can use the faster your query will be as it ensures only the data required to answer is actually retrieved from storage.
Hydrolix has the ability to use Circuit breakers that can protect the system from inefficient queries, more information on these can be found here Query Options
Updated 10 months ago