Databricks Deployment
Overview⚓︎
Databricks is a unified, open analytics platform built on top of Apache Spark that offers features like a collaborative workspace, data lakehouse capabilities, and integrated machine learning workflows. With the help of the Hydrolix Connector for Apache Spark, you can improve query speed and storage costs by using Hydrolix as the backing store for your Databricks cluster.
Requirements⚓︎
Dependencies⚓︎
- Hydrolix Cluster: A running Hydrolix cluster version 4.22.1 or higher. Deployment instructions for your preferred cloud vendor (AWS, Google, Linode, or Microsoft) can be found here.
- Databricks Spark Cluster: A Databricks account with a deployed Spark cluster.
Required User Permissions⚓︎
Querying your Hydrolix bucket using Spark requires the same permissions as querying through your cluster. Specifically, a user needs the following permissions at the levels indicated:
| Permission name | Level |
|---|---|
view_org |
Org |
view_hdxstorage |
Org |
catalog_urls_table |
Project or table |
view_table |
Project or table |
view_view |
Project or table |
view_transform |
Project or table |
select_sql |
Project or table |
If you will be querying multiple tables within the same Hydrolix project, it's easier to scope those permissions to the project level instead of granting the permissions for each table.
Runtime Requirements⚓︎
Java Runtime: Must use Java 11 or later
Databricks cluster compatibility⚓︎
The Hydrolix Connector for Apache Spark is compatible with these Databricks cluster configurations:
| Hydrolix Connector in production | Runtime version | Access mode: Dedicated (Single User) | Access mode: No isolation shared | Policy: Unrestricted | Notes |
|---|---|---|---|---|---|
| ✓ | Databricks 14.3 | ✓ | ✓ | ✓ | Works with Java 11. Set the environment variable JNAME=zulu11-ca-amd64. |
| ✓ | Databricks 15.4 | ✓ | ✓ | ✓ | Works with Java 17. Set the environment variable JNAME=zulu17-ca-amd64 |
| ✓ | Databricks 16.4 (Scala 2.12) | ✓ | ✓ | ✓ | Defaults to Java 17. |
Setup steps⚓︎
- Create a Summary Table of the Data You Will Be Querying
- Set the Hydrolix Connector Parameters
- Set the JNAME Environment Variable
- Upload and Install the Hydrolix Connector for Apache Spark
At which point you can start querying your data using Spark.
Optional steps are the following:
Create a summary table⚓︎
Create a summary table, including a transform, of the data you plan to query. Using a summary table allows you to perform some aggregation operations in the cluster, effectively reducing query time and result data set size. Instructions for creating a summary table using the Hydrolix UI and API are on the Summary Tables page. While this step can be skipped, it is highly recommended.
The general structure of summary transforms and their limitations are explained in this section on creating summary transforms.
Set The Hydrolix Connector Parameters⚓︎
The Hydrolix Connector for Apache Spark requires the following configuration parameters. Set these parameters in the Databricks UI on your existing Spark cluster. Your Hydrolix cluster username and password, which are included in these parameters, should be stored in the Databricks Secrets Manager.
| Property | Value | Description |
|---|---|---|
spark.sql.catalog.spark_catalog |
io.hydrolix.connectors.spark.SparkTableCatalog |
The fully qualified name of the class to instantiate when you ask for the hydrolix catalog. |
spark.sql.extensions |
io.hydrolix.connectors.spark.SummaryUdfExtension |
A comma-separated list of fully-qualified SQL extension classes -- using summary tables requires including SummaryUdfExtension in this set. |
spark.sql.catalog.spark_catalog.cluster_url |
https://{myhost}.hydrolix.live |
The Hydrolix cluster URL. If this field is set, jdbc_url and api_url fields are unnecessary and will be ignored. |
spark.sql.catalog.spark_catalog.jdbc_protocol |
https |
Defaults to https if not provided. Used with the cluster_url and jdbc_port configs to derive the JDBC URL. |
spark.sql.catalog.spark_catalog.jdbc_port |
8088 |
Defaults to 8088 if not provided. Used with the cluster_url and jdbc_protocol configs to derive the JDBC URL. |
spark.sql.catalog.spark_catalog.jdbc_url |
jdbc:clickhouse:https://{myhost}.hydrolix.live:8088?ssl=true |
JDBC URL of the Hydrolix query head. Note that the Clickhouse JDBC driver requires a valid database name in the URL, but the connector will read any database the user has access to. Ignored if cluster_url is provided. |
spark.sql.catalog.spark_catalog.api_url |
https://{myhost}.hydrolix.live/config/v1/ |
URL of the Hydrolix config API, usually must end with /config/v1/ including the trailing slash. Ignored if cluster_url is provided. |
spark.sql.catalog.spark_catalog.username |
{{secrets/path/to/username}} |
Username to login to the Hydrolix cluster. |
spark.sql.catalog.spark_catalog.password |
{{secrets/path/to/password}} |
Password to login to the Hydrolix cluster. |
spark.sql.catalog.spark_catalog.hdx_partitions_per_task |
1 |
Optional. Defines how many Hydrolix partitions will be read by each Spark partition. Default value is 1. For example, if this setting is set to 2 and partition planning returns 40 Hydrolix partitions, the query launches 20 Spark tasks each processing 2 Hydrolix partitions. Left unset, 40 Spark tasks would be launched processing 1 Hydrolix partition each. |
To set these parameters, expand the Advanced Options heading, open the Spark tab, and enter the key/value pairs into the Spark config. Each key should be separated from its value by a single space like the following (replacing with your Hydrolix cluster's values):
In the Databricks UI, the above configuration looks like this:

Spark configuration is dynamically reloaded for every query
The Hydrolix Connector reloads its configuration for every query. To change the configuration inside spark-shell or a Jupyter notebook, set spark.conf.set("some_setting", "value").
Verify user permissions
Make sure the user supplied within the spark properties has the required permissions for the underlying Hydrolix cluster.
Set the JNAME environment variable⚓︎
Enable JDK11 with JNAME=zulu11-ca-amd64 as shown below. Other JVM implementations may work with the Hydrolix Connector as long as they're Java11+.

Click the Create Compute button to create your Spark cluster.
Upload and Install the Hydrolix Connector⚓︎
You can obtain the Hydrolix Connector JAR here.
In your Spark Cluster's UI:
- Navigate to the Libraries tab and select Install new.
- Select the DBFS and JAR options as shown in the following image:

- Select Drop JAR here.
- In the popup local file manager window, locate the Hydrolix Connector JAR file you downloaded from S3, select it, then press Open.
The file should begin uploading. Wait for the progress bar to complete, the text "File upload is in progress" to disappear, and a green check mark before proceeding.

- Select Install.
Once installation is complete, restart your cluster. You can now start analyzing your Hydrolix data in Spark.
(Google Cloud Platform only) Set Hydrolix Cluster Storage Bucket Credential⚓︎
If you have setup your Hydrolix cluster with a default GCP storage bucket and you would like to query your default bucket, follow the GCP setup instructions to configure a credential with your storage bucket before proceeding with querying.
Query⚓︎
After you have configured your cluster, you can use the Hydrolix Connector for Apache Spark in a Spark notebook.
To begin using the connector with a Spark notebook, you’ll use one of the two commands depending on your use case:
- Python or Scala fragment:
sql("use hydrolix") - SQL fragment:
use hydrolix;
Alternatively, you can prepend each table you want to query from your Hydrolix Back-End with hydrolix..
Summary table query instructions (current version)⚓︎
Summary table queries have unique requirements.
Wrap summary aliases with hdxAgg()⚓︎
Wrap summary aliases in SELECT statements with hdxagg(). For example, you might run the following to query both an aliased column (summary_column_alias) and a non-summary column (non_summary_column):
| Query with summary alias | |
|---|---|
You can read more creating and querying summary tables in the Summary Tables documentation.
The hdxAgg() function resolves name conflicts between summary table alias columns and spark function names. For example, count may be a column name in a summary table, but count() is also a pre-existing spark function. Using the hdxAgg() function disambiguates the two.
The hdxAgg() function also resolves name conflicts between multiple summary tables. For example, one summary table might have count(timestamp) AS total in its definition while another might have count(distinct username) AS total in its definition. The hdxAgg() function disambiguates total() and ensures it works when querying either table.
Load the Hydrolix datasource⚓︎
Load the Hydrolix datasource. Doing so loads the required hdxAgg() function. The options for loading the Hydrolix datasource are:
- Run
use hydrolix - Run any query against a non-summary table
- Run
io.hydrolix.connectors.spark.HdxUdfRegistry.enableSummaryTables(spark)
Summary table query instructions (v1.0.0 only)⚓︎
The v1.0.0 release of the Hydrolix Connector for Apache Spark has unique instructions for querying summary tables.
-
To enable querying the summary table you created during the Create a Summary Table step, run the following line in a Spark shell or in a PySpark session:
text io.hydrolix.connectors.spark.HdxUdfRegistry.registerSummaryTable(spark, "{project.summary_table_name}") -
Reference any summary table column aliases using the following syntax:
summary_column_alias(). For example, suppose there is a summary table calledmy_project.my_summary_tablewhich includes the following in its definition:sql SELECT ... sum(salary) AS totalSalary, department, ...
This table can then queried with:
Example queries⚓︎
The following examples query the hydro.logs table directly rather than through a summary table.
Join Hydrolix tables with Unity Catalog tables⚓︎
To query Hydrolix data alongside data from non-Hydrolix tables, do the following:
- Use Dedicated access mode and Unrestricted policy in the cluster configuration.
- Review the Hydrolix Connector parameters relying on the
spark.sql.catalogpackage and replace any instances ofhydrolixwithspark_catalog. See Set The Hydrolix Connector Parameters for example parameter values.
Once configured, Hydrolix tables can be referenced by a short name, and the correct syntax for the USE command is:
Now you can join Hydrolix data with any other tables including Unity Catalog tables.
Group access to Unity Catalog tables
If a dedicated cluster is created for a user group, access to Unity Catalog tables is extended based on the group permissions, not on those of individual group members. In other words, the entire group must have access to the Unity table.
Troubleshoot⚓︎
Authentication Error⚓︎
If you see "access denied" errors from the Hydrolix database when you are making queries, ensure the cluster username and password are correct, and make sure that user has query permissions.
User Permissions⚓︎
Partitions in a table might be distributed across different storage buckets.
If the user set in your Hydrolix Connector configuration doesn't have the required permissions for querying all the storage buckets through the ClickHouse JDBC Driver, the cluster will be unable to sign partitions from a storage bucket for the table. This will result in the query failing and returning an error response.
Error querying summary tables⚓︎
If a user sees this error:
The fix is to load the Hydrolix datasource. You can do so by doing any one of the following:
- Run
io.hydrolix.connectors.spark.HdxUdfRegistry.enableSummaryTables(spark). This performs a global setup step which creates thehdxAggfunction. - Run the
use hydrolixfragment. This loads the Hydrolix datasource. - Run any query against a non-summary table. This loads the Hydrolix datasource.
Limitations⚓︎
Read-only⚓︎
The Hydrolix Connector for Apache Spark is read-only. ALTER and INSERT statements aren't supported.