AWS S3 Notify (aka Auto-Ingest)

Auto-ingest is used to continually ingest new files into Hydrolix from AWS S3. To use auto-ingest, you first configure Hydrolix and then enable event notifications within AWS S3 console.

  1. Create a table (autoingest "enabled": true and "pattern": "some regex").
  2. Create a transform ("is_default": true)

Hydrolix is now awaiting S3 event notifications indicating files to automatically ingest. The next step is to configure AWS S3 to generate those events and send them to auto-ingest.

  1. Navigate to the root S3 bucket containing your files, and create event notification from the properties tab. You will specify a name, prefix, event type and destination.

📘

Note:

Make sure the S3 bucket has been added to --bucket-allowlist by your Hydrolix administrator.

Example Table Request with auto-ingest enabled:

{
    "name": "example",
    "project": "{{project_uuid}}",
    "description": "autoingest example",
    "settings": {
        "autoingest": {
            "enabled": true,
            "pattern": "^s3://mybucket/level1/2020/\\d{2}/.*/log_.*.gz",
            "max_rows_per_partition": 1000000,
            "max_minutes_per_partition": 5,
            "input_aggregation": 4
        }
    }
}

Table Pattern


It is highly recommend to provide as specific a regex pattern as needed when setting an auto-ingest pattern on the table request. The auto-ingest service could be handling many tables enabled for auto-ingest, and will dispatch ingest requests to the first matching table pattern.

S3 event notifications contain the full s3 path. Hence regex match will start from ^s3://. Given the following example s3 path
s3://mybucket/level1/2020/01/app1/pattern_xyz.gz.

Possible patterns could be:

  • ^.*\\.gz$ is not recommended - too wide a match
  • ^s3://mybucket/level1/2020/\\d{2}/app1/.*.gz
  • ^.*/level1/2020/\\d{2}/app1/.*.gz
  • ^.*/level1/2020/\\d{2}/.*/pattern_\\w{3}.gz
  • ^.*/level1/2020/\\d{2}/.*/pattern_.*.gz

It should be noted that as the pattern is submitted in a JSON document. JSON requires \ chars to be escaped, hence \\ in the examples above (online re2 regex tester).

Table Auto-ingest Attributes

Auto-ingest is defined in the settings.autoingest object within the table JSON request.

ElementPurpose
enableDefault is false
patternThe S3 event notification regex pattern. Default is empty string
max_rows_per_partitionThe max row count limit per partitions. Default 33554432
max_minutes_per_partitionThe maximum number of minutes to hold in a partition. Default 15
max_active_partitionsMaximum number of active partitions. Default 576
input_aggregationControls how much data should be considered a single unit of work, which ultimately drives the size of the partition. Default 1536000000
dry_runDefault is false

Enabling Event Notifications

Enabling and configuring event notifications for an S3 Bucket is detailed here. The notifications are used to notify an SQS queue that is then read by batch-peers.

Fields required:

  • Event name. Recommend to include destination project-table.
  • Prefix. Recommend narrowing to a bucket close to the data or pattern change on more complex nested buckets structure.
  • Event type. All object create events.
  • Destination. SQS, search client_id-autoingest

Did this page help you?