Run the batch ingest job API
A batch job is simply a one off S3 ingest task. It could be either a single file, a directory of many files or a regex expression describing many directories/files matching a pattern to ingest. Once finished it's done. Batch notification (continuous S3 loading based on new files) is defined via the table.
Submitting a simple batch job using the batch job API.
curl -X POST 'https://my-domain.hydrolix.live/config/v1/orgs/{{org_uuid}}/jobs/batch' \
-H 'Authorization: Bearer thebearertoken1234567890abcdefghijklmnopqrstuvwxyz' \
-H 'Content-Type: application/json' \
-d '{
"name": "My Batch job",
"description": "A Description of my job",
"type": "batch_import",
"settings": {
"source": {
"table": "website.events",
"type": "batch",
"subtype": "aws s3",
"transform": "mytransformname",
"settings": {
"url": "s3://root-bucket/folder1/"
}
}
}
}'
{
"name": "myjob",
"description": "my job",
"uuid": "888ba890-3ece-403d-9753-4edd754bef61",
"created": "2021-06-03T12:28:41.967285Z",
"modified": "2021-06-03T12:28:42.260107Z",
"settings": {
"max_active_partitions": 576,
"max_rows_per_partition": 33554432,
"max_minutes_per_partition": 15,
"input_concurrency": 1,
"input_aggregation": 1536000000,
"max_files": 0,
"dry_run": false,
"regex_filter": "",
"source": {
"type": "batch",
"subtype": "aws s3",
"transform": "mytransform",
"table": "website.events",
"settings": {
"url": "job10"
}
}
},
"status": "ready",
"type": "batch_import",
"org": "d1234567-1234-1234-abcd-defgh123456",
"details": {
"errors": [],
"job_id": "jobid-1234-abcdejhijklm",
"duration_ms": 115,
"status_detail": {
"tasks": {
"LIST": {
"READY": 1
}
},
"estimated": true,
"percent_complete": 0
}
}
}
Under status you should see the job is ready and the system will provide you some other details.
Did it fail? Two gotchas:
S3 bucket access - Make sure the S3 bucket has been added to
--bucket-allowlist
by your Hydrolix administrator. Otherwise the batch nodes won't be able to retrieve the files.Batch-Peers not scaled up - Did you scale the peers? If not pop back to scale documentation for your cloud environment AWS or Kubernetes GKE to find out how.
Getting status of your job.
To know when the job has finished we can use the job-status API. This end-point is regularly updated by the system about the running job you query it for.
curl -X POST 'https://{{hostname}}.hydrolix.live/config/v1/orgs/{{org_uuid}}/jobs/batch/{{job_uuid}}/status' \
-H 'Authorization: Bearer thebearertoken1234567890abcdefghijklmnopqrstuvwxyz' \
When the job is complete you will see something similar to the following as the response, with the text done
in the status.
[
{
"name": "myjob",
"description": "my job",
"uuid": "888ba890-3ece-403d-9753-4edd754bef61",
"created": "2021-07-28T15:16:10.663363Z",
"modified": "2021-07-28T15:42:56.511741Z",
"settings": {
"max_active_partitions": 576,
"max_rows_per_partition": 20000000,
"max_minutes_per_partition": 15,
"input_concurrency": 1,
"input_aggregation": 1536000000,
"max_files": 0,
"dry_run": false,
"regex_filter": "",
"source": {
"type": "batch",
"subtype": "aws s3",
"transform": "mytransform",
"table": "myproject.mytable",
"settings": {
"url": "s3://mys3/path/goes/here/"
}
}
},
"status": "done",
"type": "batch_import",
"org": "d1234567-1234-1234-abcd-defgh123456",
"details": {
"errors": [],
"job_id": "jobid-1234-abcdejhijklm",
"duration_ms": 7194,
"status_detail": {
"tasks": {
"LIST": {
"DONE": 1
},
"INDEX": {
"DONE": 30
}
},
"estimated": false,
"percent_complete": 1
}
}
}
]
HELP, I need to Cancel it!
Cancelling the API is just as easy as creating one, take your job uuid and use the cancel job end-point. It will stop the job and you can try again.
Now we have the data in we can query it!
Stuck?
If you're stuck and the data is just not playing ball with Hydrolix, please give our support a shout. We'll be happy to help you - [email protected] or via your Slack channel.
Updated about 1 month ago