Hydrolix Connector for Apache Spark: AWS EMR Deployment

Analyze your Hydrolix data using Apache Spark and AWS EMR

Overview

AWS EMR (previously Elastic MapReduce) is a cloud-based platform for processing and analyzing large datasets using open-source tools like Apache Spark, Hadoop, and Hive. It automates the provisioning, scaling, and management of compute clusters, making big data processing faster and more cost-effective.

Requirements

Verify the following dependencies, permissions, and runtime requirements for successful setup.

Dependencies

  • Hydrolix cluster: A running Hydrolix cluster version 4.22.1 or higher. Find deployment instructions for your preferred cloud vendor (AWS, Google, Linode, or Microsoft) in Welcome to Hydrolix .
  • An AWS account
  • An AWS VPC and subnet. The EMR cluster requires access to external resources. It needs to communicate with the Hydrolix cluster's API and one or more S3 buckets. Set up VPCs, subnets, NACLs, and route tables in AWS to support the EMR cluster.

Required user permissions

Spark requires specific access to query a Hydrolix cluster. Assign the following account permissions at the levels indicated:

Permission nameLevel
view_orgOrg
view_hdxstorageOrg
catalog_urls_tableProject or table
view_tableProject or table
view_viewProject or table
view_transformProject or table
select_sqlProject or table

To query multiple tables in the same Hydrolix project, scope those permissions to the project level instead of granting the permissions for each table.

Runtime requirements

  • EMR version: 7.8 (Spark 3.5.4, AL2023, Java 17)

Set up AWS EMR

  1. Configure the AWS EMR cluster
  2. Configure the Spark Connector and create the cluster
  3. (Optional) Create EMR Studio: Each Studio is a self-contained, web-based integrated development environment for Jupyter notebooks that run on AWS EMR clusters. You can read more about using this tool to run workloads on your EMR clusters at Amazon EMR Studio.

Configure the AWS EMR cluster

  1. Log into AWS Console and choose the preferred region. For example: us-east-1, us-west-2.
  2. Navigate to Amazon EMR.
  3. In the left nav, find EMR on EC2.
  4. Select Clusters, then click Create cluster.
  5. Set the cluster’s name then pick the latest EMR 7.8.x version and Spark Interactive application bundle. This automatically selects:
  • Hadoop
  • Spark
  • Hive
  • Livy
  • Jupyter
  1. In Cluster configuration, choose instance types for each of the three groups of instances. EBS isn't required. Minimum configuration includes the following instance counts:
  • 1 Primary
  • 1 Core
  • 1 Task
  1. In the Networking block, select a VPC and subnet. Ensure the VPC and subnet will allow access to the Hydrolix cluster's API and any necessary S3 buckets. Allow AWS to create ElasticMapReduce-Primary and ElasticMapReduce-Core security groups.

📘

SSH and Spark UI access require additional permissions

You may need to modify the generated Security groups if you need SSH or Spark UI access.

Configure the Spark Connector and create the cluster

  1. In the Software settings block, enter the following configuration for the EMR cluster corresponding to the EMR version:
[{
    "Classification": "spark-defaults",
    "Properties": {
      "spark.jars": "{/path/to/spark-connector.jar}",
      "spark.pyspark.python": "/usr/bin/python3.9",
      "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv",
      "spark.pyspark.virtualenv.enabled": "true",
      "spark.pyspark.virtualenv.type": "native",
      "spark.sql.catalog.hydrolix": "io.hydrolix.connectors.spark.SparkTableCatalog",
      "spark.sql.catalog.hydrolix.api_url": "https://{myhost}.hydrolix.live/config/v1/",
      "spark.sql.catalog.hydrolix.jdbc_url": "jdbc:clickhouse:https://{myhost}.hydrolix.live:8088?ssl=true",
      "spark.sql.extensions": "io.hydrolix.connectors.spark.SummaryUdfExtension"
    }
  },
  {
    "Classification": "livy-conf",
    "Properties": {
      "livy.rsc.jars": "/lib/livy/rsc-jars/*,{/path/to/spark-connector.jar}"
    }
  }
]
  1. Replace {/path/to/spark-connector.jar} with the URL under Spark Connector Version in the Hydrolix Connector for Apache Spark table that's compatible with the Hydrolix cluster version. Replace {myhost}.hydrolix.live with the Hydrolix cluster URL.
  • Note that {/path/to/spark-connector.jar} is referenced twice: in spark-defaults and livy-conf.
    • The spark-defaults block affects running Spark jobs, spark-shell, or pySpark over SSH.
    • The livy-conf block impacts running Spark jobs via an AWS Notebook.
  1. In Security configuration and EC2 key pair you can configure cluster access through SSH. Follow the instructions in Create a security configuration with the Amazon EMR console or with the AWS CLI to create a security configuration for the EMR cluster.
    To set up SSH access to the EMR cluster nodes, see Use an EC2 key pair for SSH credentials for Amazon EMR .

  2. In the Identity and Access Management (IAM) roles block, click Create a service role > Create an instance profile. For credential handling (or any other custom permissions), you can attach additional policies to the generated EMR service role in IAM at a later time.

  1. Click Create cluster.

(Optional) Create EMR Studio

Follow the instructions in Create an EMR Studio to create an EMR Studio instance. Note the following requirements to ensure EMR Studio works with your previously-created EMR Cluster:

  1. While creating the EMR Studio instance: In the Networking and security block, set the same VPC and subnet as your EMR cluster and choose Default security group. After doing so, you can Create Studio.
  2. After creating the EMR Studio instance: In the Workspaces (notebooks) submenu, select your workspace then Attach cluster. Choose your EMR cluster and security groups should appear automatically. Press Attach cluster and launch.

After creation, the Workspace is started automatically (Status: Ready). When trying to attach a cluster, it may not show any available clusters. In that case, perform Actions > Stop on the workspace. At this point, Attach cluster should show your newly-created EMR cluster.

📘

If Jupyter notebook fails, check browser settings

After starting the EMR Studio instance, a Jupyter notebook should automatically start in a new tab. If the notebook doesn't start, check your browser settings as it may block the pop-up.

You should now be able to use Spark or PySpark kernels to run jobs on EMR Spark.

Verification

Run

spark.sql("use hydrolix").show()

to verify that you can successfully log in to your Hydrolix cluster.

Secrets management

There are multiple options for credentials management in AWS. This example uses the AWS Systems Manager Parameter Store. There are other methods of storing and retrieving secrets in the AWS Secrets Manager User Guide.

Create credentials

To store your Hydrolix cluster username and password as parameters using the AWS Console, AWS CLI, or Tools for Windows PowerShell, see Creating Parameter Store parameters in Systems Manager.

Set PySpark notebook credentials

  1. Open the EMR Studio attached to your EMR cluster.
  2. Use the PySpark notebook kernel to install the boto3 library using venv:
%%configure -f
{
  "conf": {
    "spark.pyspark.virtualenv.enabled": "true",
    "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv",
    "spark.pyspark.virtualenv.type": "native",
    "spark.pyspark.python": "/usr/bin/python3"
  }
  1. Then invoke
sc.install_pypi_package("boto3")

This creates a Spark session.

  1. To get the credentials and set them in a Spark context, use this Python snippet:
import boto3
ssm = boto3.client("ssm", region_name="{region}")
username = ssm.get_parameter(Name="{/path/to/username}", WithDecryption=True)["Parameter"]["Value"]
password = ssm.get_parameter(Name="{/path/to/password}", WithDecryption=True)["Parameter"]["Value"]
spark.conf.set("spark.sql.catalog.hydrolix.username", username)
spark.conf.set("spark.sql.catalog.hydrolix.password", password)

Create an EMR Serverless cluster

Using EMR Serverless requires EMR Studio. See (Optional) Create EMR Studio for instructions on setting up EMR Studio.

  1. Log in to the AWS Console and select the appropriate region (for example, us-east-1).
  2. Open Amazon EMR > EMR Serverless.
  3. Select Manage applications > Create application.
  4. Set the application name. For example, HDX-Spark-Connector.
  5. Under Application setup options select Use custom settings.
  6. In the Pre-initialized capacity section, set the driver and executors configuration including:
    • number
    • vCPUs
    • memory
    • disk size
  7. Under interactive endpoints, select both options. These are:
    • Enable endpoint for EMR studio
    • Enable Apache Livy endpoint - new
  8. (Optional) Under Network connections, set your VPC and Subnet if you plan on using EMR Serverless with other AWS services such as SageMaker Studio.
  9. In the Application configuration, set the following Hydrolix configuration:
{
  "runtimeConfiguration": [
    {
      "classification": "spark-defaults",
      "properties": {
        "spark.jars": "{spark-connector-location}",
        "spark.sql.extensions": "io.hydrolix.connectors.spark.SummaryUdfExtension",
        "spark.sql.catalog.hydrolix": "io.hydrolix.connectors.spark.SparkTableCatalog",
        "spark.sql.catalog.hydrolix.api_url": "https://{myhost}.hydrolix.live/config/v1/",
        "spark.sql.catalog.hydrolix.jdbc_url": "jdbc:clickhouse:https://{myhost}.hydrolix.live:8088?ssl=true"
      }
    }
  ]
}
  1. Choose Create application.

EMR Studio Runtime Role

A runtime role is not necessary to create a studio or cluster.

A runtime role, however, is necessary to attach EMR Studio or SageMaker Studio notebooks to an EMR Serverless cluster. Configure it using the EMR serverless cluster setup wizard with the following trust policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ServerlessTrustPolicy",
      "Effect": "Allow",
      "Principal": {
        "Service": "emr-serverless.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Additional policies can be added to access S3 buckets, AWS Glue, and more.

Integrate AWS Sagemaker

Once AWS EMR and Hydrolix are working together, they can be used as a data source and execution engine for training and publishing AI models from AWS Sagemaker.

AWS infrastructure also allows multiple ways for Sagemaker notebooks to control EMR instances. In all cases, you must connect Sagemaker to the EMR cluster.

To integrate SageMaker Studio with EMR Serverless, complete the EMR Studio and EMR Serverless setup steps first.

SageMaker AI Notebook and EMR

One method of integrating AWS Sagemaker is to configure a Sagemaker AI Notebook pointed to a running EMR instance. You can do so using the instructions in Control an Amazon EMR Spark Instance Using a Notebook.

📘

Ensure network settings allow Sagemaker access to the EMR cluster

Ensure Sagemaker and the EMR cluster are configured within the same VPC and Subnet. Do not disable direct internet access.

Once integrated, in the notebook and using the PySpark Kernel you may import AI model libraries, train the models on data received from querying Hydrolix, and publish these models to s3 for deployment with Sagemaker.

SageMaker Studio and EMR Serverless

SageMaker Studio is a more complex product that also allows integration with EMR Serverless cluster. You can find instructions for integrating Sagemaker Studio with a serverless cluster at Prepare data using EMR Serverless.

Troubleshooting

  • Verify the VPC, Subnet, and security group configurations allow EMR Studio access to EMR Serverless.
  • The Livy endpoint must be enabled on the EMR Serverless cluster. The SageMaker execution role must allow AccessLivyEndpoints and network configuration must not prevent it from accessing the Livy port. See the steps in EMR Serverless for instructions on enabling the Livy endpoint.
  • A single account setup requires the iam:PassRole permission be added to the SageMaker Execution role.