Hydrolix Spark Connector: Databricks Deployment

Overview

Databricks is a unified, open analytics platform built on top of Apache Spark that offers features like a collaborative workspace, data lakehouse capabilities, and integrated machine learning workflows. With the help of the Hydrolix Spark Connector, you can improve query speed and storage costs by using Hydrolix as the backing store for your Databricks cluster.

Requirements

Dependencies

Hydrolix Cluster: A running Hydrolix cluster version 4.22.1 or higher. Deployment instructions for your preferred cloud vendor (AWS, Google, Linode, or Microsoft) can be found here.
Databricks Spark Cluster: A Databricks account with a deployed Spark cluster.

Required User Permissions

Querying your Hydrolix bucket via Spark requires the same permissions as querying via your cluster. Specifically, a user needs the following permissions at the levels indicated:

Permission name	Level
`view_org`	Org
`view_hdxstorage`	Org
`catalog_urls_table`	Project or table
`view_table`	Project or table
`view_view`	Project or table
`view_transform`	Project or table
`select_sql`	Project or table

If you will be querying multiple tables within the same Hydrolix project, it's easier to scope those permissions to the project level instead of granting the permissions for each table.

Runtime Requirements

Java Runtime: Must use Java 11 or later

Upload the C++ Init Script

When you configure a Spark cluster with the Spark Connector, you will also need to install a C++ initialization script. This script is responsible for setting some environment variables and installing required C++ dependencies. Before setting up the cluster, first you'll upload this init script to your Databricks workspace.

You can obtain the C++ Init Script here. Download the contents into a file called install_libcxx.sh. Then navigate to your Spark workspace and do the following:

Choose or create a directory in which to store the init script.
Select import from the vertical ellipsis ( ).
Select browse.
Find the init script file and select Open.
Select import.

The imported install_libcxx.sh file should now be accessible in your workspace directory.

Setup Steps

At which point you can start querying your data using Spark.

Create a Summary Table of the Data You Will Be Querying

Create a summary table, including a transform, of the data you will be querying. This aggregates the queriable data to be queried, reducing query time. Instructions for creating a summary table via the Hydrolix UI and API are on the Summary Tables page. While this step can be skipped, it is highly recommended.

The general structure of summary transforms and their limitations are explained in this section on creating summary transforms.

Configure the Spark Cluster

Create a Spark cluster in your Databricks workspace with the following configuration:

Policy: Unrestricted
Access Mode: No Isolation Shared
Databricks Runtime Version: 14.3 LTS (Scala 2.12, Spark 3.5.0) or later

In the Databricks UI, the above configuration looks like this:

Configure the remaining settings under Performance, Instance Profile, and Tags to your preference. then select Create compute to create your Spark cluster.

Set the Hydrolix Spark Connector Parameters

The Hydrolix Spark Connector requires the following configuration parameters. These parameters can be specified in the Databricks UI when creating a cluster. Your Hydrolix cluster username and password, which are included in these parameters, should be stored in the Databricks Secrets Manager.

Property	Value	Description
`spark.sql.catalog.hydrolix`	`io.hydrolix.connectors.spark.SparkTableCatalog`	The fully qualified name of the class to instantiate when you ask for the `hydrolix` catalog
`spark.sql.extensions`	`io.hydrolix.connectors.spark.SummaryUdfExtension`	A comma-separated list of fully-qualified SQL extension classes -- using summary tables requires including SummaryUdfExtension in this set
`spark.sql.catalog.hydrolix.cluster_url`	`https://{myhost}.hydrolix.live`	Hydrolix cluster URL. If this field is set, `jdbc_url` and `api_url` fields are unnecessary and will be ignored
`spark.sql.catalog.hydrolix.jdbc_protocol`	`https`	Defaults to `https` if not provided. Used with the `cluster_url` and `jdbc_port` configs to derive the JDBC URL
`spark.sql.catalog.hydrolix.jdbc_port`	`8088`	Defaults to `8088` if not provided. Used with the `cluster_url` and `jdbc_protocol` configs to derive the JDBC URL
`spark.sql.catalog.hydrolix.jdbc_url`	`jdbc:clickhouse:https://{myhost}.hydrolix.live:8088?ssl=true`	JDBC URL of the Hydrolix query head. Note that the Clickhouse JDBC driver requires a valid database name in the URL, but the connector will read any database the user has access to. Ignored if `cluster_url` is provided.
`spark.sql.catalog.hydrolix.api_url`	`https://{myhost}.hydrolix.live/config/v1/`	URL of the Hydrolix config API, usually must end with `/config/v1/` including the trailing slash. Ignored if `cluster_url` is provided.
`spark.sql.catalog.hydrolix.username`	`{{secrets/path/to/username}}`	Username to login to the Hydrolix cluster
`spark.sql.catalog.hydrolix.password`	`{{secrets/path/to/password}}`	Password to login to the Hydrolix cluster
`spark.sql.catalog.hydrolix.hdx_partitions_per_task`	`1`	Optional. Defines how many HDX partitions will be read by each Spark partition. Default value is `1`. For example, if this setting is set to `2` and partition planning returns 40 HDX partitions, the query launches 20 Spark tasks each processing 2 HDX partitions. Left unset, 40 Spark tasks would be launched processing 1 HDX partition each.

To set these parameters, expand the Advanced Options heading, open the Spark tab, and enter the key/value pairs into the Spark config. Each key should be separated from its value by a single space like the following (replacing with your Hydrolix cluster's values):

spark.sql.catalog.hydrolix.jdbc_url jdbc:clickhouse:https://{myhost}.hydrolix.live:8088?ssl=true
spark.sql.catalog.hydrolix.api_url https://{myhost}.hydrolix.live/config/v1/
spark.sql.catalog.hydrolix io.hydrolix.connectors.spark.SparkTableCatalog
spark.sql.catalog.hydrolix.username {{secrets/path/to/username}}
spark.sql.catalog.hydrolix.password {{secrets/path/to/password}}

In the Databricks UI, the above configuration looks like this:

📘
Spark configuration is dynamically reloaded for every query.
The Spark Connector reloads its configuration for every query. To change the configuration inside spark-shell or a Jupyter notebook, set spark.conf.set("some_setting", "value").

📘
Verify user permissions.
Make sure the user supplied within the spark properties has the required permissions for the underlying Hydrolix cluster.

Set the JNAME Environment Variable

Enable JDK11 with JNAME=zulu11-ca-amd64 as shown below. Other JVM implementations may work with the Spark Connector as long as they're Java11+.

Set the C++ Init Script

In the Init Scripts tab under Advanced options:

Set Source: Workspace.
Select the file icon and navigate to the C++ init script file you uploaded in an earlier step and select Add.
The install_libcxx.sh file is visible in the Init Scripts tab.

Click the "Create Compute" button to create your Spark cluster.

Upload and Install the Hydrolix Spark Connector

You can obtain the Spark Connector JAR here.

In your Spark Cluster's UI:

Navigate to the Libraries tab and select Install new.
Select the DBFS and JAR options as shown in the following image:

Select Drop JAR here.
In the popup local file manager window, locate the Spark Connector JAR file you downloaded from S3, select it, then press Open.

The file should begin uploading. Wait for the progress bar to complete, the text "File upload is in progress" to disappear, and a green check mark before proceeding.

Select Install.

Once installation is complete, restart your cluster. You can now start analyzing your Hydrolix data in Spark.

(Google Cloud Platform only) Set Hydrolix Cluster Storage Bucket Credential

If you have setup your Hydrolix cluster with a default GCP storage bucket and you would like to query your default bucket, follow the GCP setup instructions to configure a credential with your storage bucket before proceeding with querying.

Querying

After you have configured your cluster, you can use the Hydrolix Spark Connector in a Spark notebook.

To begin using the connector with a Spark notebook, you’ll use one of the two commands depending on your use case:

Python or Scala fragment: sql("use hydrolix")
SQL fragment: use hydrolix;

Alternatively, you can prepend each table you want to query from your Hydrolix Back-End with hydrolix..

Summary table query instructions

Summary table queries have unique requirements.

Wrap summary aliases in SELECT statements with hdxagg(). For example, you might run the following to query both an aliased column (summary_column_alias) and a non-summary column (non_summary_column).

SELECT hdxAgg('summary_column_alias'), non_summary_column 
FROM hydrolix.project.summary_table 
GROUP BY <time_field_alias>

You can read more creating and querying summary tables in the Summary Tables documentation.

The hdxAgg() function resolves name conflicts between summary table alias columns and spark function names. For example, count may be a column name in a summary table, but count() is also a pre-existing spark function. Using the hdxAgg() function disambiguates the two.

The hdxAgg() function also resolves name conflicts between multiple summary tables. For example, one summary table might have count(timestamp) AS total in its definition while another might have count(distinct username) AS total in its definition. The hdxAgg() function disambiguates total() and ensures it works when querying either table.

Load the Hydrolix datasource. Doing so loads the required hdxAgg() function. The options for loading the Hydrolix datasource are:

Run use hydrolix
Run any query against a non-summary table
Run io.hydrolix.connectors.spark.HdxUdfRegistry.enableSummaryTables(spark)

(v1.0.0 only) Summary table query instructions

The v1.0.0 release of the Hydrolix Spark Connector has unique instructions for querying summary tables.

To enable querying the summary table you created during the Create a Summary Table step, run the following line in a Spark shell or in a PySpark session:

io.hydrolix.connectors.spark.HdxUdfRegistry.registerSummaryTable(spark, "{project.summary_table_name}")

Reference any summary table column aliases using the following syntax: summary_column_alias(). For example, suppose there is a summary table called my_project.my_summary_table which includes the following in its definition:

SELECT ...
sum(salary) AS totalSalary,
department,
...

This table can then queried with:

SELECT totalSalary(), department FROM hydrolix.my_project.my_summary_table GROUP BY department

Example queries

The following examples query the hydro.logs table directly rather than via a summary table.

import time;

start = time.time()
df = spark.sql("SELECT app FROM hydrolix.hydro.logs WHERE app in ('query-peer', 'query-head', 'intake-head', 'intake-peer') AND timestamp >= '2023-12-17 12:00:00' AND timestamp < '2024-02-29 12:00:00'").show()
end = time.time()
#df.show()
print(f"HDX select app query took {end - start} seconds")

start = time.time()
df = spark.sql("SELECT COUNT(*) FROM hydrolix.hydro.logs WHERE timestamp < '2024-10-18 00:00:00'").show()
end = time.time()
print(f"HDX count query took {end - start} seconds")

%sql

use hydrolix;

SELECT
  DISTINCT(`kubernetes.container_name`),
  `min_timestamp`
FROM 
  hydro.logs
ORDER BY `min_timestamp` DESC
LIMIT 100

%scala
import org.apache.spark.sql.functions.col

sql("use hydrolix");

val logs = spark.sqlContext.table("hydro.logs")

val recent = logs.filter(col("timestamp") > "2023-06-01T00:00:00.000Z")

recent.count()

Troubleshooting

Authentication Error

If you see "access denied" errors from the Hydrolix database when you are making queries, ensure the cluster username and password are correct, and make sure that user has query permissions.

User Permissions

Partitions in a table might be distributed across different storage buckets.

If the user set in your Spark Connector configuration doesn't have the required permissions for querying all the storage buckets via the ClickHouse JDBC Driver, the cluster will be unable to sign partitions from a storage bucket for the table. This will result in the query failing and returning an error response.

Error querying summary tables

If a user sees this error:

org.apache.spark.sql.AnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `hdxAgg` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]

The fix is to load the Hydrolix datasource. You can do so by doing any one of the following:

Run io.hydrolix.connectors.spark.HdxUdfRegistry.enableSummaryTables(spark). This performs a global setup step which creates the hdxAgg function.
Run the use hydrolix fragment. This loads the Hydrolix datasource.
Run any query against a non-summary table. This loads the Hydrolix datasource.

Limitations

Read-only

The Hydrolix Spark Connector is read-only. ALTER and INSERT statements aren't supported.

Hydrolix Spark Connector: Databricks Deployment

Overview

Requirements

Dependencies

Required User Permissions

Runtime Requirements

Upload the C++ Init Script

Setup Steps

Create a Summary Table of the Data You Will Be Querying

Configure the Spark Cluster

Set the Hydrolix Spark Connector Parameters

📘
Spark configuration is dynamically reloaded for every query.

📘
Verify user permissions.

Set the JNAME Environment Variable

Set the C++ Init Script

Upload and Install the Hydrolix Spark Connector

(Google Cloud Platform only) Set Hydrolix Cluster Storage Bucket Credential

Querying

Summary table query instructions

(v1.0.0 only) Summary table query instructions

Example queries

Troubleshooting

Authentication Error

User Permissions

Error querying summary tables

Limitations

Read-only

Overview

Requirements

Dependencies

Required User Permissions

Runtime Requirements

Upload the C++ Init Script

Setup Steps

Create a Summary Table of the Data You Will Be Querying

Configure the Spark Cluster

Set the Hydrolix Spark Connector Parameters

📘Spark configuration is dynamically reloaded for every query.

📘Verify user permissions.

Set the JNAME Environment Variable

Set the C++ Init Script

Upload and Install the Hydrolix Spark Connector

(Google Cloud Platform only) Set Hydrolix Cluster Storage Bucket Credential

Querying

Summary table query instructions

(v1.0.0 only) Summary table query instructions

Example queries

Troubleshooting

Authentication Error

User Permissions

Error querying summary tables

Limitations

Read-only

📘
Spark configuration is dynamically reloaded for every query.

📘
Verify user permissions.