AWS S3 Notifications

To enable autoingest of files using AWS S3 notifications, the following setup is required:

  • Create a SQS queue in the same region as your deployment. See AWS documentation
  • Grant SQS destination permission to S3 notification. See AWS Documentation
  • Enable autoingest on the table and provide configuration detail
  • Add SQS and destination S3 permissions to the cluster bucket policy

Enable autoingest on the table

To enable autoingest on a table you need to provide the following information via an API request at either creation time (POST) or update (PATCH).

{
  "name": "my-table",
  "settings": {
    "autoingest": [ {
      "enabled": true,
      "source": "sqs://your-queue-name",
      "pattern": "^s3://your-destination-bucket/level1/2020/\\d{2}/.*/log_.*.gz",  // <- optional regex expression to filter on
      "max_rows_per_partition": 12000000,
      "max_minutes_per_partition": 60,
      "max_active_partitions": 50,
    }]
  }
}

Add SQS and destination S3 permissions

You can update the cluster bucket policy directly via AWS console. Login to the console and select IAM > Policies and search for your-cluster-name-bucket > Edit. You'll need to add the following statement so that the cluster has permission to subscribe to the queue.

  {
     "Sid": "SQSAccess",
     "Effect": "Allow",
     "Action": [
       "SQS:*"
     ],
    "Resource": "arn:aws:sqs:your-region:your-account-id:your-queue-name"
  }

The last step is to add the destination-bucket to the list of S3 buckets the cluster has access to. If the bucket is in the same account as the cluster then it's as simple as appending to the existing statements

...
      {
            "Action": "s3:ListBucket",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::your-cluster-name",
                "arn:aws:s3:::hdx-public",
                "arn:aws:s3:::your-destination-bucket" // <- add
            ],
            "Sid": "ListObjectsInBucket"
        },
        {
            "Action": "s3:*Object",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::your-cluster-name/*",
                "arn:aws:s3:::hdx-public/*",
                "arn:aws:s3:::your-destination-bucket/*" // <- add
            ],
            "Sid": "AllObjectActions"
        },
...

Table autoingest pattern

It is highly recommended to provide a specific regex pattern when setting an auto-ingest pattern on table requests. The autoingest service could be handling many tables enabled for autoingest, and will dispatch ingest requests to the first matching table pattern.

S3 event notifications contain the full s3 path. Hence regex match will start from ^s3://. Given the following example s3 path
s3://mybucket/level1/2020/01/app1/pattern_xyz.gz.

Possible patterns could be:

  • ^.*\\.gz$ is not recommended - too wide a match
  • ^s3://mybucket/level1/2020/\\d{2}/app1/.*.gz
  • ^.*/level1/2020/\\d{2}/app1/.*.gz
  • ^.*/level1/2020/\\d{2}/.*/pattern_\\w{3}.gz
  • ^.*/level1/2020/\\d{2}/.*/pattern_.*.gz

It should be noted that as the pattern is submitted in a JSON document. JSON requires \ chars to be escaped, hence \\ in the examples above (online re2 regex tester).

Table autoingest attributes

Autoingest is defined in the settings.autoingest object within the table JSON request.

ElementPurpose
enableDefault is false
sourceSQS queue name containing the S3 notifications. The name must be prefixed with sqs://. Default is an empty string.
patternThe S3 event notification regex pattern. Default is an empty string.
max_rows_per_partitionThe max row count limit per partitions. Default 33554432.
max_minutes_per_partitionThe maximum number of minutes to hold in a partition. Default 15.
max_active_partitionsMaximum number of active partitions. Default 576.
input_aggregationControls how much data should be considered a single unit of work, which ultimately drives the size of the partition. Default 1536000000.
dry_runDefault is false