Hydrolix Spark Connector: AWS EMR Deployment

Analyze your Hydrolix data using Apache Spark and AWS EMR

Overview

AWS EMR (previously Elastic MapReduce) is a cloud-based platform for processing and analyzing large datasets using open-source tools like Apache Spark, Hadoop, and Hive. It automates the provisioning, scaling, and management of compute clusters, making big data processing faster and more cost-effective.

Requirements

Verify the following dependencies, permissions, and runtime requirements for successful setup.

Dependencies

  • Hydrolix cluster: A running Hydrolix cluster version 4.22.1 or higher. Find deployment instructions for your preferred cloud vendor (AWS, Google, Linode, or Microsoft) in Welcome to Hydrolix .
  • An AWS account
  • An AWS VPC and subnet. The EMR cluster requires access to external resources. It needs to communicate with the Hydrolix cluster's API and one or more S3 buckets. Set up VPCs, subnets, NACLs, and route tables in AWS to support the EMR cluster.

Required User Permissions

Spark requires the permissions needed to query a Hydrolix cluster. Users need the following permissions at the levels indicated:

Permission nameLevel
view_orgOrg
view_hdxstorageOrg
catalog_urls_tableProject or table
view_tableProject or table
view_viewProject or table
view_transformProject or table
select_sqlProject or table

To query multiple tables in the same Hydrolix project, scope those permissions to the project level instead of granting the permissions for each table.

Runtime requirements

Set up AWS EMR

  1. Configure the AWS EMR cluster
  2. Configure the Spark Connector and create the cluster
  3. (Optional) Create EMR Studio: Each Studio is a self-contained, web-based integrated development environment for Jupyter notebooks that run on AWS EMR clusters. You can read more about using this tool to run workloads on your EMR clusters at Amazon EMR Studio.

Configure the AWS EMR cluster

  1. Log into AWS Console and choose the preferred region. For example: us-east-1, us-west-2.
  2. Navigate to Amazon EMR.
  3. In the left nav, find EMR on EC2.
  4. Select Clusters, then click Create cluster.
  5. Set the cluster’s name then pick the latest EMR 7.8.x version and Spark Interactive application bundle. This automatically selects:
  • Hadoop
  • Spark
  • Hive
  • Livy
  • Jupyter
  1. In Cluster configuration, choose instance types for each of the three groups of instances. EBS isn't required. Minimum configuration includes the following instance counts:
  • 1 Primary
  • 1 Core
  • 1 Task
  1. In the Networking block, select a VPC and subnet. Ensure the VPC and subnet will allow access to the Hydrolix cluster's API and any necessary S3 buckets. Allow AWS to create ElasticMapReduce-Primary and ElasticMapReduce-Core security groups.

📘

SSH and Spark UI access require additional permissions

You may need to modify the generated Security groups if you need SSH or Spark UI access.

Configure the Spark Connector and create the cluster

  1. In the Software settings block, enter the following configuration for the EMR cluster corresponding to the EMR version:
[
  {
    "Classification": "container-executor",
    "Configurations": [
      {
        "Classification": "docker",
        "Properties": {
          "docker.privileged-containers.registries": "public.ecr.aws/p1r6p5i6",
          "docker.trusted.registries": "public.ecr.aws/p1r6p5i6"
        }
      }
    ],
    "Properties": {}
  },
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.executorEnv.JAVA_HOME": "/usr/lib/jvm/java-17-amazon-corretto",
      "spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE": "public.ecr.aws/p1r6p5i6/hdx-spark-connector:latest",
      "spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE": "docker",
      "spark.jars": "<spark-connector-location>",
      "spark.pyspark.python": "/usr/bin/python3.9",
      "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv",
      "spark.pyspark.virtualenv.enabled": "true",
      "spark.pyspark.virtualenv.type": "native",
      "spark.sql.catalog.hydrolix": "io.hydrolix.connectors.spark.SparkTableCatalog",
      "spark.sql.catalog.hydrolix.api_url": "https://<hdx_cluster>.hydrolix.dev/config/v1/",
      "spark.sql.catalog.hydrolix.jdbc_url": "jdbc:clickhouse:https://<hdx_cluster>.hydrolix.dev:8088?ssl=true",
      "spark.sql.extensions": "io.hydrolix.connectors.spark.SummaryUdfExtension"
    }
  },
  {
    "Classification": "livy-conf",
    "Properties": {
      "livy.rsc.jars": "/lib/livy/rsc-jars/*,<spark-connector-location>"
    }
  }
]
  1. Replace {/path/to/spark-connector.jar} with the URL under Spark Connector Version in the Hydrolix Spark Connector table that's compatible with the Hydrolix cluster version. Replace {myhost}.hydrolix.live with the Hydrolix cluster URL.
  • Note that {/path/to/spark-connector.jar} is referenced twice: in spark-defaults and livy.conf.
    • The spark-defaults block affects running Spark jobs, spark-shell, or pySpark over SSH.
    • The livy-conf block impacts running Spark jobs via an AWS Notebook.
  1. In Security configuration and EC2 key pair you can configure cluster access through SSH. Follow the instructions in Create a security configuration with the Amazon EMR console or with the AWS CLI to create a security configuration for the EMR cluster.
    To set up SSH access to the EMR cluster nodes, see Use an EC2 key pair for SSH credentials for Amazon EMR .

  2. In the Identity and Access Management (IAM) roles block, click Create a service role > Create an instance profile. For credential handling (or any other custom permissions), you can attach additional policies to the generated EMR service role in IAM at a later time.

  1. Click Create cluster.

(Optional) Create EMR Studio

📘

EMR Serverless can’t be used with the Hydrolix Spark Connector

When creating an EMR Studio in AWS Console, the default options create the necessary Roles, S3 bucket, and an EMR Serverless config along with its Runtime Role. EMR Serverless can’t be used with the HDX Spark Connector. To avoid creating unnecessary resources, you can create an EMR Studio Service role and a bucket manually and then create a Studio in Custom mode.

Follow the instructions in Create an EMR Studio to create an EMR studio instance. Note the following requirements to ensure EMR Studio works with your previously-created EMR Cluster:

  1. While creating the EMR Studio instance: In the Networking and security block, set the same VPC and subnet as your EMR cluster and choose Default security group. After doing so, you can Create Studio.
  2. After creating the EMR Studio instance: In the Workspaces (notebooks) submenu, select your workspace then Attach cluster. Choose your EMR cluster and security groups should appear automatically. Press Attach cluster and launch.

After creation, the Workspace is started automatically (Status: Ready). When trying to attach a cluster, it may not show any available clusters. In that case, perform Actions > Stop on the workspace. At this point, Attach cluster should show your newly-created EMR cluster.

📘

If Jupyter notebook fails, check browser settings

After starting the EMR Studio instance, a Jupyter notebook should automatically start in a new tab. If the notebook doesn't start, check your browser settings as it may block the pop-up.

You should now be able to use Spark or PySpark kernels to run jobs on EMR Spark.

Secrets management

There are multiple options for credentials management in AWS. This example uses the AWS Systems Manager Parameter Store. There are other methods of storing and retrieving secrets in the AWS Secrets Manager User Guide

Create credentials

To store your Hydrolix cluster username and password as parameters using the AWS console, AWS CLI, or Tools for Windows PowerShell, see Creating Parameter Store parameters in Systems Manager.

Set PySpark notebook credentials

  1. Open the EMR Studio attached to your EMR cluster.
  2. Use the PySpark notebook kernel to install the boto3 library using venv:
%%configure -f
{
  "conf": {
    "spark.pyspark.virtualenv.enabled": "true",
    "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv",
    "spark.pyspark.virtualenv.type": "native",
    "spark.pyspark.python": "/usr/bin/python3"
  }
  1. Then invoke
sc.install_pypi_package("boto3")

This creates a Spark session.

  1. To get the credentials and set them in a Spark context, use this Python snippet:
import boto3
ssm = boto3.client("ssm", region_name="{region}")
username = ssm.get_parameter(Name="{/path/to/username}", WithDecryption=True)["Parameter"]["Value"]
password = ssm.get_parameter(Name="{/path/to/password}", WithDecryption=True)["Parameter"]["Value"]
spark.conf.set("spark.sql.catalog.hydrolix.username", username)
spark.conf.set("spark.sql.catalog.hydrolix.password", password)

Verification

Run

spark.sql("use hydrolix").show()

to verify that you can successfully log in to your Hydrolix cluster.