AWS S3 Notifications

Auto-ingest is used to continually ingest new files into Hydrolix from AWS S3. To use auto-ingest, an SQS queue should be created, a configuration of Hydrolix table settings is made and finally event notifications within the AWS S3 are enabled.

. Create an SQS queue.

Amazon provides the following documentation on Creating an SQS queue.

Create/Update a table and transform

When creating or updating a table to the autoingest component should be configured(autoingest: "enabled": true, "pattern": "some regex", "source": "sqs://my-queue").
Create a transform ("is_default": true)

Hydrolix is now awaiting S3 event notifications indicating files to automatically ingest. The next step is to configure AWS S3 to generate those events and send them to auto-ingest. Navigate to the root S3 bucket containing your files, and create event notification from the properties tab. You will specify a name, prefix, event type and destination (same as table source property).

Note the Autoingest service only loads files from the

πŸ“˜

Note using Cloudformation:

Make sure the S3 bucket has been added to --bucket-allowlist by your Hydrolix administrator.

πŸ“˜

Note using EKS

If you are using auto-ingest with EKS, you need to update the policy to access SQS and the source S3 bucket.

Example Table Request with auto-ingest enabled:

{
    "name": "example",
    "description": "autoingest example",
    "settings": {
        "autoingest": {
            "enabled": true,
      "source":"sqs://my-sqs-queue",
            "pattern": "^s3://mybucket/level1/2020/\\d{2}/.*/log_.*.gz",
            "max_rows_per_partition": 1000000,
            "max_minutes_per_partition": 5,
            "input_aggregation": 4
        }
    }
}

Setup policy with EKS only

By default the policy in the Prepare EKS Cluster only allows access to the S3 bucket where Hydrolix stores the data. In order to setup autoingest, you need to update the policy to access the SQS queue and the source S3 bucket you will be loading data from.

The following is an example policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Action": [
                "sqs:DeleteMessage",
                "sqs:GetQueueUrl",
                "sqs:ChangeMessageVisibility",
                "sqs:ReceiveMessage",
                "sqs:SendMessage",
                "sqs:GetQueueAttributes",
                "s3:ListBucket",
                "sqs:ListQueueTags",
                "sqs:ListDeadLetterSourceQueues",
                "sqs:PurgeQueue",
                "sqs:DeleteQueue",
                "sqs:CreateQueue",
                "sqs:SetQueueAttributes"
            ],
            "Resource": [
                "arn:aws:sqs:$REGION:$ACCOUNTID:$SQS_QUEUE_NAME",
                "arn:aws:s3:::$CLIENT_ID",
                "arn:aws:s3:::hdx-public",
                "arn:aws:s3:::$SOURCE_BUCKET"
            ]
        },
        {
            "Sid": "2",
            "Effect": "Allow",
            "Action": "s3:*Object",
            "Resource": [
                "arn:aws:s3:::$CLIENT_ID/*",
                "arn:aws:s3:::hdx-public/*",
                "arn:aws:s3:::$SOURCE_BUCKET/*"
            ]
        },
        {
            "Sid": "3",
            "Effect": "Allow",
            "Action": "sqs:ListQueues",
            "Resource": "*"
        }
    ]
}

Table Pattern

It is highly recommended to provide a specific regex pattern when setting an auto-ingest pattern on table requests. The auto-ingest service could be handling many tables enabled for auto-ingest, and will dispatch ingest requests to the first matching table pattern.

S3 event notifications contain the full s3 path. Hence regex match will start from ^s3://. Given the following example s3 path
s3://mybucket/level1/2020/01/app1/pattern_xyz.gz.

Possible patterns could be:

  • ^.*\\.gz$ is not recommended - too wide a match
  • ^s3://mybucket/level1/2020/\\d{2}/app1/.*.gz
  • ^.*/level1/2020/\\d{2}/app1/.*.gz
  • ^.*/level1/2020/\\d{2}/.*/pattern_\\w{3}.gz
  • ^.*/level1/2020/\\d{2}/.*/pattern_.*.gz

It should be noted that as the pattern is submitted in a JSON document. JSON requires \ chars to be escaped, hence \\ in the examples above (online re2 regex tester).

Table Auto-ingest Attributes

Auto-ingest is defined in the settings.autoingest object within the table JSON request.

Element

Purpose

enable

Default is false

source

SQS queue name containing the S3 notifications. The name must be prefixed with sqs://. Default is an empty string.

pattern

The S3 event notification regex pattern. Default is an empty string.

max_rows_per_partition

The max row count limit per partitions. Default 33554432.

max_minutes_per_partition

The maximum number of minutes to hold in a partition. Default 15.

max_active_partitions

Maximum number of active partitions. Default 576.

input_aggregation

Controls how much data should be considered a single unit of work, which ultimately drives the size of the partition. Default 1536000000.

dry_run

Default is false

Enabling Event Notifications

Enabling and configuring event notifications for an S3 Bucket is detailed here. The notifications are used to notify an SQS queue that is then read by batch-peers.

Fields required:

  • Event name. Recommend to include destination project-table.
  • Prefix. Recommend narrowing to a bucket close to the data or pattern change on more complex nested buckets structure.
  • Event type. All object create events.
  • Destination. SQS, search client_id-autoingest or whichever queue you created

Did this page help you?