Hydrolix Spark Connector: Microsoft Fabric Deployment

Analyze your Hydrolix data using Apache Spark and Microsoft Fabric

Overview

Microsoft Fabric is an end-to-end data platform that unifies data engineering, data science, real-time analytics, and business intelligence under one SaaS offering. With the help of the Hydrolix Spark Connector, you can improve query speed and storage costs of your Fabric-deployed Spark cluster by using Hydrolix as the backing store.

Requirements

Dependencies

  • Hydrolix Cluster: A running Hydrolix cluster version 4.22.1 or higher. Deployment instructions for your preferred cloud vendor (AWS, Google, Linode, or Microsoft) can be found here.
  • Fabric Spark Cluster: A Microsoft Fabric account with a deployed Spark cluster.

Required User Permissions

Querying your Hydrolix bucket via Spark requires the same permissions as querying via your cluster. Specifically, a user needs the following permissions at the levels indicated:

Permission nameLevel
view_orgOrg
view_hdxstorageOrg
catalog_urls_tableProject or table
view_tableProject or table
view_viewProject or table
view_transformProject or table
select_sqlProject or table

If you will be querying multiple tables within the same Hydrolix project, it's easier to scope those permissions to the project level instead of granting the permissions for each table.

Setup Steps

  1. Create a Summary Table of the Data You Will Be Querying
  2. (optional) Create a new resource group in Azure
  3. Create MS Fabric capacity in Azure
  4. Create and configure a Workspace
  5. (optional) Store Hydrolix credentials in Azure vault

At which point you can start querying your data using Spark.

Create a Summary Table of the Data You Will Be Querying

Create a summary table, including a transform, of the data you will be querying. This aggregates the queriable data, reducing query time. Instructions for creating a summary table via the Hydrolix UI and API are on the Summary Tables page. While this step can be skipped, it is highly recommended.

The general structure of summary transforms and their limitations are explained in this section on creating summary transforms.

(optional) Create a new resource group in Azure

The primary purpose of setting up a new Resource Group is to keep related services logically organized together. You can skip this step if you've already created a Resource Group.

To create a new Resource Group in Azure using the Azure Portal, follow these steps:

  1. Log in to the Azure Portal using your credentials.
  2. Navigate to the Resource Groups section.
  3. Select Create to begin creating a new Resource Group.
  4. Enter a name for the Resource Group.
  5. Select the appropriate Subscription and Region.
  6. Select Review + create > Create to create the Resource Group.
fabric-create-resource-group.png

Create MS Fabric capacity in Azure

Capacity refers to the resource limits and reservations assigned to nodes for managing workloads in Microsoft Fabric.

To create a new MS Fabric Capacity using the Azure Portal, follow these steps:

  1. Log in to your Azure Portal.
  2. Select + Create a resource.
  3. Locate Microsoft Fabric and select Create.
  4. In the Project Details section, select the appropriate Subscription and Resource group.
  5. Provide a unique Capacity name, choose a Region, select the Size of capacity units, and designate an administrator.
  6. Select Review + Create > Create to set up the Capacity. Please note that capacity may start on its own automatically.

Create and configure a workspace

A workspace in MS Fabric is a collaborative environment where users can create, manage, and share data, reports, dashboards, and other analytics assets. The following steps walk you through the process of creating and configuring a workspace to query Hydrolix data with the Hydrolix Spark Connector using python notebooks.

To create a new workspace, follow these steps.

  1. Open the Capacity created in a previous step. Verify that it's running.
  2. Navigate to Microsoft Fabric.
  3. In the left-hand panel, select Workspaces > + New workspace.
  4. Fill in a name for the new workspace.
  5. Under License mode, select Fabric Capacity and choose the created capacity from the available options.
  6. Save the workspace.
  7. Open your created workspace.
  8. Select Workspace settings > Data Engineering/Science > Spark settings > Environment.
  9. Enable Set default environment then select Workspace default > New environment.
fabric-new-env.png
  1. Provide a custom name for the new environment. The new environment settings will open immediately after creation.
  2. Navigate to Custom Library and upload the Spark Connector JAR file.
fabric-jar-upload.png
  1. Navigate to Spark properties and configure the connection to your Hydrolix cluster using the following properties.
PropertyValueDescription
spark.sql.catalog.hydrolixio.hydrolix.connectors.spark.SparkTableCatalogThe fully qualified name of the class to instantiate when you ask for the hydrolix catalog
spark.sql.extensionsio.hydrolix.connectors.spark.SummaryUdfExtensionA comma-separated list of fully-qualified SQL extension classes -- using summary tables requires including SummaryUdfExtension in this set
spark.sql.catalog.hydrolix.cluster_urlhttps://{myhost}.hydrolix.liveHydrolix cluster URL. If this field is set, jdbc_url and api_url fields are unnecessary and will be ignored
spark.sql.catalog.hydrolix.jdbc_protocolhttpsDefaults to https if not provided. Used with the cluster_url and jdbc_port configs to derive the JDBC URL
spark.sql.catalog.hydrolix.jdbc_port8088Defaults to 8088 if not provided. Used with the cluster_url and jdbc_protocol configs to derive the JDBC URL
spark.sql.catalog.hydrolix.jdbc_urljdbc:ch://{host}:{port}/{database}?ssl=trueJDBC URL of the Hydrolix query head. Note that the Clickhouse JDBC driver requires a valid database name in the URL, but the connector will read any database the user has access to. Ignored if cluster_url is provided.
spark.sql.catalog.hydrolix.api_urlhttps://{myhost}.hydrolix.live/config/v1/URL of the Hydrolix config API, usually must end with /config/v1/ including the trailing slash. Ignored if cluster_url is provided.
spark.sql.catalog.hydrolix.username{hdx-username}Username to login to the Hydrolix cluster
spark.sql.catalog.hydrolix.password{hdx-password}Password to login to the Hydrolix cluster
spark.sql.catalog.hydrolix.hdx_partitions_per_task1Optional. Defines how many HDX partitions will be read by each Spark partition. Default value is 1. For example, if this setting is set to 2 and partition planning returns 40 HDX partitions, the query launches 20 Spark tasks each processing 2 HDX partitions. Left unset, 40 Spark tasks would be launched processing 1 HDX partition each.
spark.driver.extraJavaOptions-Dio.netty.tryReflectionSetAccessible=trueOptional. Set this option if you want to enable MS Fabric Native execution engine for your Spark queries.
spark.executor.extraJavaOptions-Dio.netty.tryReflectionSetAccessible=trueOptional. Set this option if you want to enable MS Fabric Native execution engine for your Spark queries.

📘

Spark configuration is dynamically reloaded for every query.

The Spark Connector reloads its configuration for every query. To change the configuration inside spark-shell or a Jupyter notebook, set spark.conf.set("some_setting", "value").

📘

Verify user permissions.

Make sure the user supplied within the spark properties has the required permissions for the underlying Hydrolix cluster.

  1. Save and Publish your Spark configurations. Wait until the publishing process is complete.
  2. Return to your workspace, select the newly created environment as your default environment, and select Save.
  3. Add a new item by selecting + New item > Notebook in your workspace.
  4. Run a simple query from the Notebook to test the connection.
fabric-test-query.png

(optional) Store Hydrolix credentials in Azure vault

Azure Key Vault is a cloud service provided by Microsoft Azure that helps you securely store and manage sensitive information, such as cryptographic keys, passwords, certificates, and other secrets.

To safely store and retrieve Hydrolix credentials, set up an Azure Key Vault using the following steps.

  1. Log in to the Azure Portal.
  2. Select + Create a resource.
  3. Search for Key Vault and select Create.
  4. In the Project Details section, select the previously-created subscription and resource group from the dropdown menus.
  5. Provide a unique name for your key vault, select a Region, and Pricing tier. Leave the remaining fields as default and select Review + create > Create
fabric-create-key-vault.png
  1. Set up Role-Based Access Control (RBAC).
    1. Open the Key Vault.
    2. Navigate to Access control (IAM) in the menu, select Add, then choose Add role assignment.
    3. From the list of roles, select Key Vault Secrets Officer.
    4. Go to the Members section, select + Select Member, and add your account.
    5. Finally, select Review + assign to save the RBAC settings.
    6. To double-check that the role has been assigned to you, go to View my access.
  2. For more secure access to your secrets, set up a Private Endpoint. In Power BI, navigate to your workspace, open Workspace settings, and find Network security. Select + Create to add a new Private Endpoint.
fabric-private-endpoint-add-new.png
  1. Give it a name, then head to your Azure Key Vault to grab the Resource identifier. In the Key Vault's Overview section, look on the right side. Select JSON View and copy the Resource ID. Back in the Private Endpoint setup, set Azure Key Vault as the Target sub-resource, and click Create.
  2. Now, go back to your Key Vault. Open Settings, then Networking, and click manage next to Public network access. Restrict public access to only selected networks (if you want to create and view secrets from the Azure Portal, make sure to add your client IP in the IP Addresses section too) and select Save.
fabric-kv-restrict-access.png)
  1. While you're still in that window, switch to the Private endpoints tab and approve the connection from the Private Endpoint you just created.
  2. At this point, you should be ready to add your secrets---like hydrolix-username and hydrolix-password. Go to Objects > Secrets and select Generate/Import. Add your username and password.
fabric-import-secrets.png
  1. To fetch and apply these credentials in your Spark configuration, run the following code:
KV_URL = "https://{key-vault-name}.vault.azure.net/"

spark.config.set("spark.sql.catalog.hydrolix.username", mssparkutils.credentials.getSecret(KV_URL, "{hydrolix-username}"))
spark.config.set("spark.sql.catalog.hydrolix.password", mssparkutils.credentials.getSecret(KV_URL, "{hydrolix-password}"))
fabric-apply-creds.png
  1. Lastly, make sure to remove any plain-text credentials from the Spark properties in your Environment configuration.

Querying

After you have configured your cluster, you can use the Hydrolix Spark Connector in a Spark notebook.

To begin using the connector with a Spark notebook, you’ll use one of the two commands depending on your use case:

  • Python or Scala fragment: sql("use hydrolix")
  • SQL fragment: use hydrolix;

Alternatively, you can prepend each table you want to query from your Hydrolix Back-End with hydrolix..

Summary table query instructions

Summary table queries have unique requirements.

  1. Wrap summary aliases in SELECT statements with hdxagg(). For example, you might run the following to query both an aliased column (summary_column_alias) and a non-summary column (non_summary_column).
SELECT hdxAgg('summary_column_alias'), non_summary_column 
FROM hydrolix.project.summary_table 
GROUP BY {time_field_alias}

You can read more creating and querying summary tables in the Summary Tables documentation.

The hdxAgg() function resolves name conflicts between summary table alias columns and spark function names. For example, count may be a column name in a summary table, but count() is also a pre-existing spark function. Using the hdxAgg() function disambiguates the two.

The hdxAgg() function also resolves name conflicts between multiple summary tables. For example, one summary table might have count(timestamp) AS total in its definition while another might have count(distinct username) AS total in its definition. The hdxAgg() function disambiguates total() and ensures it works when querying either table.

  1. Load the Hydrolix datasource. Doing so loads the required hdxAgg() function. The options for loading the Hydrolix datasource are:
  • Run use hydrolix
  • Run any query against a non-summary table
  • Run io.hydrolix.connectors.spark.HdxUdfRegistry.enableSummaryTables(spark)

(v1.0.0 only) Summary table query instructions

The v1.0.0 release of the Hydrolix Spark Connector has unique instructions for querying summary tables.

  1. To enable querying the summary table you created during the Create a Summary Table step, run the following line in a Spark shell or in a PySpark session:
io.hydrolix.connectors.spark.HdxUdfRegistry.registerSummaryTable(spark, "{project.summary_table_name}")
  1. Reference any summary table column aliases using the following syntax: summary_column_alias(). For example, suppose there is a summary table called my_project.my_summary_table which includes the following in its definition:
SELECT ...
sum(salary) AS totalSalary,
department,
...

This table can then queried with:

SELECT totalSalary(), department FROM hydrolix.my_project.my_summary_table GROUP BY department

Troubleshooting

Authentication Error

If you see "access denied" errors from the Hydrolix database when you are making queries, ensure the cluster username and password are correct, and make sure that user has query permissions.

User Permissions

Partitions in a table might be distributed across different storage buckets.

If the user set in your Spark Connector configuration doesn't have the required permissions for querying all the storage buckets via the ClickHouse JDBC Driver, the cluster will be unable to sign partitions from a storage bucket for the table. This will result in the query failing and returning an error response.

Error querying summary tables

If a user sees this error:

org.apache.spark.sql.AnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `hdxAgg` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`]

The fix is to load the Hydrolix datasource. You can do so by doing any one of the following:

  • Run io.hydrolix.connectors.spark.HdxUdfRegistry.enableSummaryTables(spark). This performs a global setup step which creates the hdxAgg function.
  • Run the use hydrolix fragment. This directly loads the Hydrolix datasource.
  • Run any query against a non-summary table. This directly loads the Hydrolix datasource.

Limitations

Read-only

The Hydrolix Spark Connector is read-only. ALTER and INSERT statements aren't supported.