via Apache Spark Connector
Analyze your Hydrolix data using Apache Spark
Hydrolix can be used as an database alternative for Spark and Databricks using the open-source Hydrolix Spark Connector. The Hydrolix Spark Connector is tested and supported on Databricks Cloud. Other cloud deployment options may work without code changes, but are not officially supported.
Table of Contents
- Build the Hydrolix Spark Connector
- Prepare the Hydrolix Spark Connector Parameters
- Deploy on Databricks
- Upload and Install the Hydrolix Spark Connector
- Use the Hydrolix Spark Connector With a Spark Notebook
- Note: Credentials
- Troubleshooting
Build the Hydrolix Spark Connector
Prerequisites:
- Java Development Kit (JDK) 11 or 17
- SBT
Run the following commands to download and build the Hydrolix Spark Connector:
git clone [email protected]:hydrolix/spark-connector.git hydrolix-spark-connector
cd hydrolix-spark-connector
sbt -J-Xmx4G assembly
This will produce the following file named similarly to hydrolix-spark-connector-assembly_2.12-1.6.0-SNAPSHOT.jar
in the target/scala
directory. This file is later specified as one of the inputs when starting the Spark shell.
Prepare the Hydrolix Spark Connector Parameters
The Hydrolix Spark Connector requires the following configuration parameters. These parameters can be specified in the Databricks UI when creating a cluster. The ‘...’ prefix in these option names should be replaced with spark.sql.catalog.hydrolix
.
Option name | Value | Description |
---|---|---|
... | io.hydrolix.connectors.spark.SparkTableCatalog | Fully-qualified name of the Scala class to instantiate when the hydrolix catalog is selected |
...api_url | https://<hdx-cluster>/config/v1/ | API URL of your Hydrolix cluster, ends with "/config/v1/" including the trailing slash |
...jdbc_url | jdbc:clickhouse://<hdx-cluster>:8088/_local?ssl=true | JDBC URL of your Hydrolix cluster |
...username | <user name> | Hydrolix cluster username |
...password | <password> | Hydrolix cluster password |
...cloud_cred_1 | <Google base64 key or AWS access key ID> | First cloud credential. For AWS, this is an AWS access key ID. For Google Cloud, this is the contents of the Google Cloud service account key file, compressed with gzip and then encoded as base64. |
...cloud_cred_2 | <AWS secret> | This is only needed for AWS, not Google Cloud. |
Deploy on Databricks
Create or choose a Databricks workspace for the deployment.
Configure the Spark Cluster
Create a Spark cluster in your Databricks workspace with the following configuration:
- Policy: Unrestricted
- Access Mode: No Isolation Shared
- Databricks Runtime Version: Version 13 or later
The next image shows how the configuration should look. The image may be slightly different as Databricks makes changes to their UI.
Set additional configuration parameters. In the Advanced Options section, open the Spark tab as shown in the following image.
Set the following Spark configuration parameters:
spark.sql.catalog.hydrolix
spark.sql.catalog.hydrolix.api_url
spark.sql.catalog.hydrolix.jdbc_url
spark.sql.catalog.hydrolix.username
spark.sql.catalog.hydrolix.password
spark.sql.catalog.hydrolix.cloud_cred_1
spark.sql.catalog.hydrolix.cloud_cred_2
(only for AWS)
Set the Hydrolix Spark Connector’s configuration parameters in the Advanced Options section as name-value pairs delimited by whitespace, or configure them in a notebook using spark.conf.set(<key>, <value>)
, allowing you to use Databricks Secrets.
Credentials With Google Cloud Platform
If you’re using Google Cloud with Databricks, set the hydrolix.cloud_cred_1
parameter to the base64-encoded gzipped key value. If you don't have this key, create one and download it from the Google Cloud Console, navigating to IAM & Admin->Service Accounts
, clicking on your service account, then clicking on the Keys
tab.
Store this file somewhere secure, then create the gzip'ed and base64-encoded version. For example, with a keyfile called .json, use this command on a Linux or OS X machine:
cat <keyfilename>.json | gzip | base64
Next, include the entire resulting string as one line in the spark.sql.catalog.hydrolix.cloud_cred_1
parameter:
spark.sql.catalog.hydrolix.cloud_cred_1 <gcpKeyBase64>
You do not need to specify spark.sql.catalog.hydrolix.cloud_cred_2
with Google Cloud Storage.
Credentials With AWS
When using Databricks in AWS, set your cloud credentials to the following:
spark.sql.catalog.hydrolix.cloud_cred_1 <AWS_ACCESS_KEY_ID>
spark.sql.catalog.hydrolix.cloud_cred_2 <AWS_SECRET_KEY>
Set the JNAME Environment Variable
Enable the use of JDK11 by setting the JNAME environment variable to zulu11-ca-amd64
, as shown in the following image.
Click the "Create Compute" button to save your work.
Upload and Install the Hydrolix Spark Connector
In the Libraries tab for the Spark cluster, select Install new as shown in the next image.
Make sure the Upload and JAR options are selected as shown in the following image.
In your local file manager, navigate to the target/scala-2.12
subfolder of the hydrolix-spark-connector
source tree, and move hydrolix-spark-connector-assembly-1.1.1-SNAPSHOT.jar
(or similarly-named JAR file) into the Drop JAR here window that’s shown in the previous image.
Don’t select anything while it’s uploading or you’ll have to upload the file again. After the upload is finished, you’ll see a green checkmark next to the file as shown in the next image.
Once the upload is finished, select Install and wait a few minutes while the cluster restarts. You can now start analyzing your Hydrolix data in Spark!
Use the Hydrolix Spark Connector With a Spark Notebook
After you have configured your cluster, you can use the Hydrolix Spark Connector in a Spark notebook.
To begin using the connector with a Spark notebook, you’ll use one of the two commands depending on your use case:
- SQL fragment:
use hydrolix
. - Python or Scala fragment:
sql("use hydrolix")
.
Then use your Spark notebook to make a SQL query or use the Dataframe API:
SELECT
user_agent,
count(*) AS count
FROM hydrolix_project.hydrolix_table
WHERE
my_timestamp_column BETWEEN '2023-05-10T00:00:00.000Z' AND '2023-05-10T00:00:01.000Z'
GROUP BY
user_agent
ORDER BY
count DESC
from pyspark.sql.functions import col, desc, count
my_table = table("hydrolix_project.hydrolix_table")
ts = col("my_timestamp_column")
sample_df = my_table.filter(ts.between('2023-05-10T00:00:00.000Z', '2023-05-10T00:00:01.000Z')) \
.groupBy(my_table.user_agent) \
.agg(count("*").alias("count")) \
.orderBy(desc("count")) \
.cache()
Note: Credentials
If you don't want to put your credentials in your Databricks configuration, you can set all configuration options at runtime instead. When running the connector in a Databricks workspace, you can retrieve and set credentials from the Databricks secret manager. For example, in Python:
spark.conf.set("spark.sql.catalog.hydrolix.password", dbutils.secrets.get("my-scope", "hdx-password"))
spark.conf.set("spark.sql.catalog.hydrolix.cloud_cred_1", dbutils.secrets.get("my-scope", "aws_access_key_id"))
spark.conf.set("spark.sql.catalog.hydrolix.cloud_cred_2", dbutils.secrets.get("my-scope", "aws_secret_access_key"))
...
sql("use hydrolix")
Troubleshooting
Timeout errors
- If you see timeout errors from the Hydrolix database when you are making queries, make sure you have the IP address of your Spark cluster listed in your hydrolixcluster.yaml
ip_allowlist
. - If you see "access denied' errors from the Hydrolix database when you are making queries, ensure the username and password are correct, and make sure that user has query permissions.
Updated 19 days ago