Run the batch ingest job API

A batch job is simply a one-off S3 ingest task. It could be either a single file, a directory of many files, or a regex expression describing many directories/files matching a pattern to ingest. Batch notification (continuous S3 loading based on new files) is defined via the table.

You can submit a simple batch job with the batch job API.

curl -X POST 'https://my-domain.hydrolix.live/config/v1/orgs/{{org_uuid}}/jobs/batch' \
-H 'Authorization: Bearer thebearertoken1234567890abcdefghijklmnopqrstuvwxyz' \
-H 'Content-Type: application/json' \
-d '{
    "name": "My Batch job",
    "description": "A Description of my job",
    "type": "batch_import",
    "settings": {
        "source": {
            "table": "website.events",
            "type": "batch",
            "subtype": "aws s3",
            "transform": "mytransformname",
            "settings": {
                "url": "s3://root-bucket/folder1/"
            }
        }
    }
}'
{
      "name": "myjob",
      "description": "my job",
      "uuid": "888ba890-3ece-403d-9753-4edd754bef61",
      "created": "2021-06-03T12:28:41.967285Z",
      "modified": "2021-06-03T12:28:42.260107Z",
      "settings": {
         "max_active_partitions": 576,
         "max_rows_per_partition": 33554432,
         "max_minutes_per_partition": 15,
         "input_concurrency": 1,
         "input_aggregation": 1536000000,
         "max_files": 0,
         "dry_run": false,
         "regex_filter": "",
         "source": {
            "type": "batch",
            "subtype": "aws s3",
            "transform": "mytransform",
            "table": "website.events",
            "settings": {
               "url": "job10"
            }
         }
      },
      "status": "ready",
      "type": "batch_import",
      "org": "d1234567-1234-1234-abcd-defgh123456",
      "details": {
         "errors": [],
         "job_id": "jobid-1234-abcdejhijklm",
         "duration_ms": 115,
         "status_detail": {
            "tasks": {
               "LIST": {
                  "READY": 1
               }
            },
            "estimated": true,
            "percent_complete": 0
         }
      }
}

The status field in the response should show that the job is ready.

❗️

Did it fail? Two gotchas:

S3 bucket access : Make sure the S3 bucket has been added to --bucket-allowlist by your Hydrolix administrator. Otherwise the batch nodes won't be able to retrieve the files.

Batch-Peers not scaled up: Did you scale the peers? If not pop back to scale documentation for your cloud environment AWS or Kubernetes GKE to find out how.

Getting the status of your job

You can use the job-status API to check if your job is finished. Hydrolix regularly updates this endpoint with information about running jobs.

curl -X POST 'https://{{hostname}}.hydrolix.live/config/v1/orgs/{{org_uuid}}/jobs/batch/{{job_uuid}}/status' \
-H 'Authorization: Bearer thebearertoken1234567890abcdefghijklmnopqrstuvwxyz' \

When the job is complete, you will get a response like the following example with a status of done.

[
{
      "name": "myjob",
      "description": "my job",
      "uuid": "888ba890-3ece-403d-9753-4edd754bef61",
      "created": "2021-07-28T15:16:10.663363Z",
      "modified": "2021-07-28T15:42:56.511741Z",
      "settings": {
         "max_active_partitions": 576,
         "max_rows_per_partition": 20000000,
         "max_minutes_per_partition": 15,
         "input_concurrency": 1,
         "input_aggregation": 1536000000,
         "max_files": 0,
         "dry_run": false,
         "regex_filter": "",
         "source": {
            "type": "batch",
            "subtype": "aws s3",
            "transform": "mytransform",
            "table": "myproject.mytable",
            "settings": {
               "url": "s3://mys3/path/goes/here/"
            }
         }
      },
      "status": "done",
      "type": "batch_import",
      "org": "d1234567-1234-1234-abcd-defgh123456",
      "details": {
         "errors": [],
         "job_id": "jobid-1234-abcdejhijklm",
         "duration_ms": 7194,
         "status_detail": {
            "tasks": {
               "LIST": {
                  "DONE": 1
               },
               "INDEX": {
                  "DONE": 30
               }
            },
            "estimated": false,
            "percent_complete": 1
         }
      }
   }
]

🚧

Canceling a batch job

To cancel a batch job, query the cancel job end-point.

Now it's time to query the data!

📘

Need support?

If you're stuck, reach out to support at [email protected] or via your Slack channel.