via Apache Spark Connector

Analyze your Hydrolix data using Apache Spark

Hydrolix can be used as an database alternative for Spark and Databricks using the open-source Hydrolix Spark Connector. The Hydrolix Spark Connector is tested and supported on Databricks Cloud. Other cloud deployment options may work without code changes, but are not officially supported.

Table of Contents

Build the Hydrolix Spark Connector

Prerequisites:

Run the following commands to download and build the Hydrolix Spark Connector:

git clone https://github.com/hydrolix/spark-connector.git hydrolix-spark-connector

cd hydrolix-spark-connector

sbt -J-Xmx4G assembly

This will produce the following file named similarly to hydrolix-spark-connector-assembly_2.12-1.6.0-SNAPSHOT.jar in the target/scala directory. This file is later specified as one of the inputs when starting the Spark shell.

Prepare the Hydrolix Spark Connector Parameters

The Hydrolix Spark Connector requires the following configuration parameters. These parameters can be specified in the Databricks UI when creating a cluster. The ‘...’ prefix in these option names should be replaced with spark.sql.catalog.hydrolix.

Option nameValueDescription
...io.hydrolix.connectors.spark.SparkTableCatalogFully-qualified name of the Scala class to instantiate when the hydrolix catalog is selected
...api_urlhttps://<hdx-cluster>/config/v1/API URL of your Hydrolix cluster, ends with "/config/v1/" including the trailing slash
...jdbc_urljdbc:clickhouse://<hdx-cluster>:8088/_local?ssl=trueJDBC URL of your Hydrolix cluster
...username<user name>Hydrolix cluster username
...password<password>Hydrolix cluster password
...cloud_cred_1<Google base64 key or AWS access key ID>First cloud credential. For AWS, this is an AWS access key ID. For Google Cloud, this is the contents of the Google Cloud service account key file, compressed with gzip and then encoded as base64.
...cloud_cred_2<AWS secret>This is only needed for AWS, not Google Cloud.

Deploy on Databricks

Create or choose a Databricks workspace for the deployment.

Configure the Spark Cluster

Create a Spark cluster in your Databricks workspace with the following configuration:

  • Policy: Unrestricted
  • Access Mode: No Isolation Shared
  • Databricks Runtime Version: Version 13 or later

The next image shows how the configuration should look. The image may be slightly different as Databricks makes changes to their UI.

Specify policy, Access mode, and Databricks Runtime Version when creating a cluster.

Set additional configuration parameters. In the Advanced Options section, open the Spark tab as shown in the following image.

Specify Spark config and environment variables in the Advanced Options section.

Set the following Spark configuration parameters:

  • spark.sql.catalog.hydrolix
  • spark.sql.catalog.hydrolix.api_url
  • spark.sql.catalog.hydrolix.jdbc_url
  • spark.sql.catalog.hydrolix.username
  • spark.sql.catalog.hydrolix.password
  • spark.sql.catalog.hydrolix.cloud_cred_1
  • spark.sql.catalog.hydrolix.cloud_cred_2 (only for AWS)

Set the Hydrolix Spark Connector’s configuration parameters in the Advanced Options section as name-value pairs delimited by whitespace, or configure them in a notebook using spark.conf.set(<key>, <value>), allowing you to use Databricks Secrets.

Credentials With Google Cloud Platform

If you’re using Google Cloud with Databricks, set the hydrolix.cloud_cred_1 parameter to the base64-encoded gzipped key value. If you don't have this key, create one and download it from the Google Cloud Console, navigating to IAM & Admin->Service Accounts, clicking on your service account, then clicking on the Keys tab.

Store this file somewhere secure, then create the gzip'ed and base64-encoded version. For example, with a keyfile called .json, use this command on a Linux or OS X machine:

cat <keyfilename>.json | gzip | base64

Next, include the entire resulting string as one line in the spark.sql.catalog.hydrolix.cloud_cred_1 parameter:

spark.sql.catalog.hydrolix.cloud_cred_1 <gcpKeyBase64>

You do not need to specify spark.sql.catalog.hydrolix.cloud_cred_2 with Google Cloud Storage.

Credentials With AWS

When using Databricks in AWS, set your cloud credentials to the following:

spark.sql.catalog.hydrolix.cloud_cred_1 <AWS_ACCESS_KEY_ID>
spark.sql.catalog.hydrolix.cloud_cred_2 <AWS_SECRET_KEY>

Set the JNAME Environment Variable

Enable the use of JDK11 by setting the JNAME environment variable to zulu11-ca-amd64, as shown in the following image.

Example Spark configuration settings in the Databricks UI.

Click the "Create Compute" button to save your work.

Upload and Install the Hydrolix Spark Connector

In the Libraries tab for the Spark cluster, select Install new as shown in the next image.

Libraries tab shows Install new button.

Make sure the DBFS and JAR options are selected as shown in the following image.

In your local file manager, navigate to the target/scala-2.12 subfolder of the hydrolix-spark-connector source tree, and move hydrolix-spark-connector-assembly-1.1.1-SNAPSHOT.jar (or similarly-named JAR file) into the Drop JAR here window that’s shown in the previous image.

Don’t select anything while it’s uploading or you’ll have to upload the file again. After the upload is finished, you’ll see a green checkmark next to the file as shown in the next image.

Once the upload is finished, select Install and wait a few minutes while the cluster restarts. You can now start analyzing your Hydrolix data in Spark!

Use the Hydrolix Spark Connector With a Spark Notebook

After you have configured your cluster, you can use the Hydrolix Spark Connector in a Spark notebook.

To begin using the connector with a Spark notebook, you’ll use one of the two commands depending on your use case:

  • SQL fragment: %sql.
  • Python or Scala fragment: sql("use hydrolix").

Then use your Spark notebook to make a SQL query or use the Dataframe API:

%sql
SELECT
user_agent,
count(*) AS count
FROM hydrolix.hydrolix_project.hydrolix_table
WHERE
my_timestamp_column BETWEEN '2023-05-10T00:00:00.000Z' AND '2023-05-10T00:00:01.000Z'
GROUP BY
user_agent
ORDER BY
count DESC
from pyspark.sql.functions import col, desc, count

my_table = table("hydrolix_project.hydrolix_table")
ts = col("my_timestamp_column")

sample_df = my_table.filter(ts.between('2023-05-10T00:00:00.000Z', '2023-05-10T00:00:01.000Z')) \
    .groupBy(my_table.user_agent) \
    .agg(count("*").alias("count")) \
    .orderBy(desc("count")) \
    .cache()

Note: Credentials

If you don't want to put your credentials in your Databricks configuration, you can set all configuration options at runtime instead. When running the connector in a Databricks workspace, you can retrieve and set credentials from the Databricks secret manager. For example, in Python:

spark.conf.set("spark.sql.catalog.hydrolix.password", dbutils.secrets.get("my-scope", "hdx-password"))
spark.conf.set("spark.sql.catalog.hydrolix.cloud_cred_1", dbutils.secrets.get("my-scope", "aws_access_key_id"))
spark.conf.set("spark.sql.catalog.hydrolix.cloud_cred_2", dbutils.secrets.get("my-scope", "aws_secret_access_key"))
...
sql("use hydrolix")

Troubleshooting

Timeout errors

  • If you see timeout errors from the Hydrolix database when you are making queries, make sure you have the IP address of your Spark cluster listed in your hydrolixcluster.yaml ip_allowlist.
  • If you see "access denied' errors from the Hydrolix database when you are making queries, ensure the username and password are correct, and make sure that user has query permissions.