Apache Spark Connector

Analyze your Hydrolix data using Apache Spark and Databricks

Hydrolix Spark Connector

Overview

The Hydrolix Spark Connector combines the cost and query efficiency of the Hydrolix platform with the rich data analysis and distributed computing power offered by Apache Spark. By integrating your Apache Spark ecosystem with Hydrolix, the Spark Connector enables the cost and performance efficiency gains of Hydrolix as the backing store while presenting the data in the notebooks and coding environments you are familiar with.

The Apache Spark Connector JAR can be downloaded here.

Requirements

Dependencies

  • Hydrolix Cluster: A running Hydrolix cluster version 4.22.1 or higher. Deployment instructions for your preferred cloud vendor (AWS, Google, Linode, or Microsoft) can be found here.
  • Databricks Spark Cluster: A Databricks account for deploying your Spark cluster.

Runtime Requirements

  • Java Runtime: Must use Java 11 or later

Versioning

The Spark Connector version is of the format:

{Spark Connector version: major.minor.patch}-{HDX version: major.minor.patch}

An example of this formatting is the following:

v1.0.0-v4.22.1.jar

The Spark Connector is compatible with its corresponding Hydrolix version and all more recent versions. The following is a compatibility matrix between Spark Connector versions and their compatible Hydrolix cluster versions:

Spark Connector VersionHydrolix Version
v1.0.0v4.22.1+

Upload the C++ Init Script

When you configure a Spark cluster with the Spark Connector, you will also need to install a C++ initialization script. This script is responsible for setting some environment variables and installing required C++ dependencies. Before setting up the cluster, first you'll upload this init script to your Databricks workspace.

You can obtain the C++ Init Script here. Download the contents into a file called install_libcxx.sh. Then navigate to your Spark workspace and do the following:

  • Choose or create a directory in which to store the init script
  • Select the top right kebab menu -> import
  • Select browse
  • Find the init script file and select Open
  • Select import

The imported install_libcxx.sh file should now be accessible in your workspace directory.

Create a Summary Table of the Data You Will Be Querying

Create a summary table, including a transform, of the data you will be querying. This aggregates the data to be queried, reducing query time. Instructions for creating a summary table via the Hydrolix UI and API are on the Summary Tables page.

The general structure of summary transforms and their limitations are explained in this section on creating summary transforms.

Configure the Spark Cluster

Create a Spark cluster in your Databricks workspace with the following configuration:

  • Policy: Unrestricted
  • Access Mode: No Isolation Shared
  • Databricks Runtime Version: 14.3 LTS (Scala 2.12, Spark 3.5.0) or later

In the Databricks UI, the above configuration looks like this:

Configure the remaining settings under Performance, Instance Profile, and Tags to your preference. then select Create compute to create your Spark cluster.

Set the Hydrolix Spark Connector Parameters

The Hydrolix Spark Connector requires the following configuration parameters. These parameters can be specified in the Databricks UI when creating a cluster. Your Hydrolix cluster username and password, which are included in these parameters, should be stored in the Databricks Secrets Manager.

KeyValueDescription
spark.sql.catalog.hydrolix.jdbc_urljdbc:clickhouse:https://{hdx-cluster}.hydrolix.dev:8088?ssl=trueClickHouse JDBC URL of your Hydrolix cluster
spark.sql.catalog.hydrolix.api_urlhttps://{hdx-cluster}.hydrolix.dev/config/v1/API URL of your Hydrolix cluster, ends with /config/v1/
spark.sql.catalog.hydrolixio.hydrolix.connectors.spark.SparkTableCatalogFully qualified name of the Scala class to instantiate when the hydrolix catalog is selected
spark.sql.catalog.hydrolix.username{{secrets/path/to/username}}Hydrolix cluster username
spark.sql.catalog.hydrolix.password{{secrets/path/to/password}}Hydrolix cluster password

To set these parameters, expand the Advanced Options heading, open the Spark tab, and enter the key/value pairs into the Spark config. Each key should be separated from its value by a single space like the following (replacing with your Hydrolix cluster's values):

spark.sql.catalog.hydrolix.jdbc_url jdbc:clickhouse:https://{hdx-cluster}.hydrolix.dev:8088?ssl=true
spark.sql.catalog.hydrolix.api_url https://{hdx-cluster}.hydrolix.dev/config/v1/
spark.sql.catalog.hydrolix io.hydrolix.connectors.spark.SparkTableCatalog
spark.sql.catalog.hydrolix.username {{secrets/path/to/username}}
spark.sql.catalog.hydrolix.password {{secrets/path/to/password}}

In the Databricks UI, the above configuration looks like this:


🚧

Required User Permissions

Querying your Hydrolix bucket via Spark requires the same permissions as querying via your cluster (seen here under SQL) plus the additional permission catalog_urls_table scoped to the table being queried. The latter permission ensures the supplied user can use the catalog_urls endpoint which is responsible for signing the partitions queried in a table.

Set the JNAME Environment Variable

Enable JDK11 by setting the JNAME environment variable to zulu11-ca-amd64 as shown in the following image. Other JVM implementations may work with the Spark Connector as long as they are Java11+.

Set the C++ Init Script

In the Init Scripts tab under Advanced options:

  • Set Source: Workspace
  • Select the file icon and navigate to the C++ init script file you uploaded in an earlier step and select Add
  • The install_libcxx.sh file should be visible in the Init Scripts tab as seen here:

Click the "Create Compute" button to create your Spark cluster.

Upload and Install the Hydrolix Spark Connector

You can obtain the Spark Connector JAR here.

In your Spark Cluster's UI, navigate to the Libraries tab and select Install new. Select the DBFS and JAR options as shown in the following image:

Select Drop JAR here. In the popup local file manager window, locate the Spark Connector JAR file you downloaded from S3, select it, then press Open.

The file should begin uploading. Wait for the progress bar to complete, the text "File upload is in progress" to disappear, and a green check mark before proceeding;

Select Install. Once installation is complete, restart your cluster. You can now start analyzing your Hydrolix data in Spark.

(Google Cloud Platform only) Set Hydrolix Cluster Storage Bucket Credential

If you have setup your Hydrolix cluster with a default GCP storage bucket and you would like to query your default bucket, follow the GCP setup instructions to configure a credential with your storage bucket before proceeding with querying.

Querying

After you have configured your cluster, you can use the Hydrolix Spark Connector in a Spark notebook.

To begin using the connector with a Spark notebook, you’ll use one of the two commands depending on your use case:

  • Python or Scala fragment: sql("use hydrolix").
  • SQL fragment: use hydrolix;.

Alternatively, you can prepend each you table you want to query from your Hydrolix backend with hydrolix..

If you will be querying the summary table you created during the Create a Summary Table step above, . To do this, add the following line via a Spark shell or in a PySpark session:

io.hydrolix.connectors.spark.HdxUdfRegistry.registerSummaryTable(spark, "{project.summary_table_name}")

The following examples query the hydro.logs table directly rather than via a summary table.

import time;

start = time.time()
df = spark.sql("SELECT app FROM hydrolix.hydro.logs WHERE app in ('query-peer', 'query-head', 'intake-head', 'intake-peer') AND timestamp >= '2023-12-17 12:00:00' AND timestamp < '2024-02-29 12:00:00'").show()
end = time.time()
#df.show()
print(f"HDX select app query took {end - start} seconds")

start = time.time()
df = spark.sql("SELECT COUNT(*) FROM hydrolix.hydro.logs WHERE timestamp < '2024-10-18 00:00:00'").show()
end = time.time()
print(f"HDX count query took {end - start} seconds")
%sql

use hydrolix;

SELECT
  DISTINCT(`kubernetes.container_name`),
  `min_timestamp`
FROM 
  hydro.logs
ORDER BY `min_timestamp` DESC
LIMIT 100
%scala
import org.apache.spark.sql.functions.col

sql("use hydrolix");

val logs = spark.sqlContext.table("hydro.logs")

val recent = logs.filter(col("timestamp") > "2023-06-01T00:00:00.000Z")

recent.count()

Troubleshooting

Authentication Error

If you see "access denied" errors from the Hydrolix database when you are making queries, ensure the cluster username and password are correct, and make sure that user has query permissions.

User Permissions

Partitions in a table might be distributed across different storage buckets.
If the user set in your Spark Connector configuration does not have the required permissions for querying all the storage buckets via the ClickHouse JDBC Driver (listed here under SQL) and the catalog_urls_table permission for the table they are querying, the cluster will be unable to sign partitions from a storage bucket for the table. This will result in the query failing and returning an error response.

Limitations

Read-only

The Hydrolix Spark Connector is currently read-only. ALTER and INSERT statements are not supported.