Hydrolix Spark Connector: AWS EMR Deployment
Analyze your Hydrolix data using Apache Spark and AWS EMR
Overview
AWS EMR (previously Elastic MapReduce) is a cloud-based platform for processing and analyzing large datasets using open-source tools like Apache Spark, Hadoop, and Hive. It automates the provisioning, scaling, and management of compute clusters, making big data processing faster and more cost-effective.
Requirements
Verify the following dependencies, permissions, and runtime requirements for successful setup.
Dependencies
- Hydrolix cluster: A running Hydrolix cluster version 4.22.1 or higher. Find deployment instructions for your preferred cloud vendor (AWS, Google, Linode, or Microsoft) in Welcome to Hydrolix .
- An AWS account
- An AWS VPC and subnet. The EMR cluster requires access to external resources. It needs to communicate with the Hydrolix cluster's API and one or more S3 buckets. Set up VPCs, subnets, NACLs, and route tables in AWS to support the EMR cluster.
Required User Permissions
Spark requires the permissions needed to query a Hydrolix cluster. Users need the following permissions at the levels indicated:
Permission name | Level |
---|---|
view_org | Org |
view_hdxstorage | Org |
catalog_urls_table | Project or table |
view_table | Project or table |
view_view | Project or table |
view_transform | Project or table |
select_sql | Project or table |
To query multiple tables in the same Hydrolix project, scope those permissions to the project level instead of granting the permissions for each table.
Runtime requirements
-
EMR version:
7.8
(Spark 3.5.4, AL2023, Java 17). -
Docker image: A custom-built Docker image with the appropriate Java version.
Set up AWS EMR
- Configure the AWS EMR cluster
- Configure the Spark Connector and create the cluster
- (Optional) Create EMR Studio: Each Studio is a self-contained, web-based integrated development environment for Jupyter notebooks that run on AWS EMR clusters. You can read more about using this tool to run workloads on your EMR clusters at Amazon EMR Studio.
Configure the AWS EMR cluster
- Log into AWS Console and choose the preferred region. For example:
us-east-1
,us-west-2
. - Navigate to Amazon EMR.
- In the left nav, find EMR on EC2.
- Select Clusters, then click Create cluster.
- Set the cluster’s name then pick the latest EMR 7.8.x version and Spark Interactive application bundle. This automatically selects:
- Hadoop
- Spark
- Hive
- Livy
- Jupyter

- In Cluster configuration, choose instance types for each of the three groups of instances. EBS isn't required. Minimum configuration includes the following instance counts:
- 1 Primary
- 1 Core
- 1 Task
- In the Networking block, select a VPC and subnet. Ensure the VPC and subnet will allow access to the Hydrolix cluster's API and any necessary S3 buckets. Allow AWS to create
ElasticMapReduce-Primary
andElasticMapReduce-Core
security groups.

SSH and Spark UI access require additional permissions
You may need to modify the generated Security groups if you need SSH or Spark UI access.
Configure the Spark Connector and create the cluster
- In the Software settings block, enter the following configuration for the EMR cluster corresponding to the EMR version:
[
{
"Classification": "container-executor",
"Configurations": [
{
"Classification": "docker",
"Properties": {
"docker.privileged-containers.registries": "public.ecr.aws/p1r6p5i6",
"docker.trusted.registries": "public.ecr.aws/p1r6p5i6"
}
}
],
"Properties": {}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.executorEnv.JAVA_HOME": "/usr/lib/jvm/java-17-amazon-corretto",
"spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE": "public.ecr.aws/p1r6p5i6/hdx-spark-connector:latest",
"spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE": "docker",
"spark.jars": "<spark-connector-location>",
"spark.pyspark.python": "/usr/bin/python3.9",
"spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.sql.catalog.hydrolix": "io.hydrolix.connectors.spark.SparkTableCatalog",
"spark.sql.catalog.hydrolix.api_url": "https://<hdx_cluster>.hydrolix.dev/config/v1/",
"spark.sql.catalog.hydrolix.jdbc_url": "jdbc:clickhouse:https://<hdx_cluster>.hydrolix.dev:8088?ssl=true",
"spark.sql.extensions": "io.hydrolix.connectors.spark.SummaryUdfExtension"
}
},
{
"Classification": "livy-conf",
"Properties": {
"livy.rsc.jars": "/lib/livy/rsc-jars/*,<spark-connector-location>"
}
}
]
- Replace
{/path/to/spark-connector.jar}
with the URL under Spark Connector Version in the Hydrolix Spark Connector table that's compatible with the Hydrolix cluster version. Replace{myhost}.hydrolix.live
with the Hydrolix cluster URL.
- Note that
{/path/to/spark-connector.jar}
is referenced twice: inspark-defaults
andlivy.conf
.- The
spark-defaults
block affects running Spark jobs, spark-shell, or pySpark over SSH. - The
livy-conf
block impacts running Spark jobs via an AWS Notebook.
- The
-
In Security configuration and EC2 key pair you can configure cluster access through SSH. Follow the instructions in Create a security configuration with the Amazon EMR console or with the AWS CLI to create a security configuration for the EMR cluster.
To set up SSH access to the EMR cluster nodes, see Use an EC2 key pair for SSH credentials for Amazon EMR . -
In the Identity and Access Management (IAM) roles block, click Create a service role > Create an instance profile. For credential handling (or any other custom permissions), you can attach additional policies to the generated EMR service role in IAM at a later time.

- Click Create cluster.
(Optional) Create EMR Studio
EMR Serverless can’t be used with the Hydrolix Spark Connector
When creating an EMR Studio in AWS Console, the default options create the necessary Roles, S3 bucket, and an EMR Serverless config along with its Runtime Role. EMR Serverless can’t be used with the HDX Spark Connector. To avoid creating unnecessary resources, you can create an EMR Studio Service role and a bucket manually and then create a Studio in Custom mode.
Follow the instructions in Create an EMR Studio to create an EMR studio instance. Note the following requirements to ensure EMR Studio works with your previously-created EMR Cluster:
- While creating the EMR Studio instance: In the Networking and security block, set the same VPC and subnet as your EMR cluster and choose Default security group. After doing so, you can Create Studio.
- After creating the EMR Studio instance: In the Workspaces (notebooks) submenu, select your workspace then Attach cluster. Choose your EMR cluster and security groups should appear automatically. Press Attach cluster and launch.
After creation, the Workspace is started automatically (Status: Ready
). When trying to attach a cluster, it may not show any available clusters. In that case, perform Actions > Stop on the workspace. At this point, Attach cluster should show your newly-created EMR cluster.
If Jupyter notebook fails, check browser settings
After starting the EMR Studio instance, a Jupyter notebook should automatically start in a new tab. If the notebook doesn't start, check your browser settings as it may block the pop-up.
You should now be able to use Spark or PySpark kernels to run jobs on EMR Spark.
Secrets management
There are multiple options for credentials management in AWS. This example uses the AWS Systems Manager Parameter Store. There are other methods of storing and retrieving secrets in the AWS Secrets Manager User Guide
Create credentials
To store your Hydrolix cluster username and password as parameters using the AWS console, AWS CLI, or Tools for Windows PowerShell, see Creating Parameter Store parameters in Systems Manager.
Set PySpark notebook credentials
- Open the EMR Studio attached to your EMR cluster.
- Use the PySpark notebook kernel to install the
boto3
library usingvenv
:
%%configure -f
{
"conf": {
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.python": "/usr/bin/python3"
}
- Then invoke
sc.install_pypi_package("boto3")
This creates a Spark session.
- To get the credentials and set them in a Spark context, use this Python snippet:
import boto3
ssm = boto3.client("ssm", region_name="{region}")
username = ssm.get_parameter(Name="{/path/to/username}", WithDecryption=True)["Parameter"]["Value"]
password = ssm.get_parameter(Name="{/path/to/password}", WithDecryption=True)["Parameter"]["Value"]
spark.conf.set("spark.sql.catalog.hydrolix.username", username)
spark.conf.set("spark.sql.catalog.hydrolix.password", password)
Verification
Run
spark.sql("use hydrolix").show()
to verify that you can successfully log in to your Hydrolix cluster.
Updated about 4 hours ago