Batch Ingestion - Jobs

Batch ingestion is used to ingest multiple files at once into Hydrolix. To use batch ingestion, you must:

  1. Create a table
  2. Create a transform
  3. Submit a batch job request via the UI or API

You can submit multiple batch jobs for the same table as required.

Note: make sure the S3 bucket has been added to --bucket-allowlist by your Hydrolix administrator.

Example Job Request

Click arrow to expand

{
	"type": "batch_import",
	"name": "job sample trips",
	"description": "sample_trips batchjob",
	"settings": {
		"max_rows_per_partition": 10000000,
		"max_minutes_per_partition": 14400,
		"source": {
			"table": "sample.trips",
			"type": "batch",
			"subtype": "aws s3",
			"transform": "transform_trips",
			"settings": {
				"url": "s3://mydatatoingest"
			}
		}
	}
}

Example ingest use cases based on url path

  • A single file i.e. "s3://mybucket/another/file.gz"
  • All files in a single bucket i.e. "s3://mybucket/another/"
  • All files matching regex pattern i.e. "s3://mybucket/" along with "settings.regex_filter": "^s3://mybucket/.*/.*.gz"

Job Attributes

A transform describes the shape of individual pieces of data. A job describes how to treat the data set as a whole as it is being ingested.

Element Purpose
name A unique name for this job in this organization
description An optional description
type Only accepts the value batch_import
settings The settings to use for this particular ingestion job

settings

Example Settings

Click arrow to expand

{ ...
"settings": {
		"max_rows_per_partition": 10000000,
		"max_minutes_per_partition": 14400,
		"source": {
			...
		}
	}
}

Some data sets consist of many small files, other data sets consist of fewer larger files. Hydrolix ultimately writes data into “partitions”. The number and size of partitions influences performance of query.

What is best for each data set is an “it depends” answer, however, consider:

  1. Partitions are a single unit to be processed. This means that queries of large partitions cannot be parallelized as much as smaller partitions.
  2. Smaller partitions mean more parallelization, but also mean less efficient use of resources.

Regex Filter.

If data is stored in a complex bucket structure on AWS S3 and cannot be expressed with a simple S3 path. regex_filter allows you to express the structure pattern to search. It is used in conjuction with settings.url which narrows down the scope.

Given the following example s3 source path s3://mybucket/level1/2020/01/app1/pattern_xyz.gz with setting "url":"s3://mybucket/".

Possible regex_filter pattern could be:

  • ^.*\\.gz$
  • ^s3://mybucket/level1/2020/\\d{2}/app1/.*.gz
  • ^.*/level1/2020/\\d{2}/app1/.*.gz
  • ^.*/level1/2020/\\d{2}/.*/pattern_\\w{3}.gz
  • ^.*/level1/2020/\\d{2}/.*/pattern_.*.gz
Element Description
regex_filter Filters the files to ingest using a Regex match. Note backwards slash ‘\’ need to be escaped within the regex string. The pattern starts from s3://

Tuning Input Jobs

For tuning the partition size based on the input data files, We provide 3 parameters:

Element Description Default
dry_run Whether or not the job is a dry run. If true, all indexing work will be done but no results will be uploaded. Resulting HDX partitions are effectively thrown away.
input_aggregation Controls how much data should be considered a single unit of work, which ultimately drives the size of the partition. Files larger than the input_aggregation will be processed as a single unit of work. 1536000000
input_concurrency Input Concurrency restricts the number of batch peer processes which are run on a single instance. This should be kept at 1. If you wish to change this please contact Hydrolix. 1
max_active_partitions Maximum number of active partitions. 576
max_minutes_per_partition The maximum number of minutes to hold in a partition. For dense data sets, five minutes of data may be massive. In other data sets, 2 weeks of data may be required for the same volume. The velocity of your data will influence this value. 15
max_rows_per_partition Based on the width of your data, you can control total the data size of the partition with max_rows_per_partition. 33554432
max_files Number of files to dispatch to peers. Limiting is typically only used for testing. In general this should not be set so that the entire bucket is procesed 0 (disabled)

Ingest Parallelization

Batch ingest is performed on compute instances. Batch performance can be improved by:

  1. Adding more batch instances
  2. Adding larger batch instances with more parallelism

Each scenario has the potential to be different. The type and number of instances can be adjusted via Hydrolix configuration. max_active_partitions tells Hydrolix how many partitions it should work on in parallel at one time.

max_active_partitionstotal number of partitions that should be processing on a single batch peer at a time - this is a balance of speed and memory

source

Example source

Click arrow to expand

{
    ...
        "source": {
					"table": "sample.trips",
					"type": "batch",
					"subtype": "aws s3",
					"transform": "transform_trips",
					"settings": {
						"url": "s3://mydatatoingest"
					}
			}
    ...
}

The source element specifies information about the data itself, where it is, and how it should be treated.

Element Purpose
table The table were the data should go. The format is <project_name>.<table_name>
type Only accepts the value batch
subtype Only accepts aws s3 currently
transform The name of a transform that already exists to use for this job
settings.url The s3 paths of the files to be ingested. All paths will be analyzed in the given location and all files in the path will be ingested.

Cancel Jobs

Use the cancel jobs endpoint to cancel the batch ingest job and tasks associated with the job ID. The cancellation will be reflected in the status output.

Response codes

Code Description
200 Success
404 Job not found
405 Request was not a POST
500 Internal error

Jobs Status

Get the status of a job and it’s tasks. This endpoint is suitable for polling for job completion.

Response codes

Code Description
200 Success
404 Job not found
405 Request was not a GET
500 Internal error

Response body on success

{
  "status": "RUNNING",
  "status_detail": {
    "tasks": {
      "INDEX": {
        "READY": 5
      },
      "LIST": {
        "DONE": 1
      }
    },
    "percent_complete": 0.16666667,
    "estimated": false
  }
}
Key Description Optional
.status Status of the job. One of READY,RUNNING,DONE, or CANCELED No
.status_detail In-depth task status information if tasks exists Yes
.status_detail.tasks Aggregations of task types and states No
.status_detail.percent_complete Job progress percentage as a float No
.status_detail.estimated Whether or not the progress is estimated. Once all listing tasks are complete progress percentage is no longer estimated No