Ingest
Ingesting data into the Hydrolix platform can be completed in a number of ways. The platform is deployed with architectural independence for each of the supported methodologies for data ingest.
- Batch - Load from a storage bucket
- Stream API (HTTP) - Stream messages for ingest to an HTTP interface
- Kafka - Ingest from an existing Kafka infrastructure.
- Kinesis - Ingest from AWS Kinesis infrastructure.
Batch
Batch: Configure it!
To configure Batch for your platform information can be found here - Loading data with Batch
Batch contains two methods for ingest.
- Batch Load
- Batch Load with Autoingest
In both cases the batch mechanism pulls in data from a supplied storage bucket. The activation of the load process can be completed either through the API and Portal interfaces or through using the Autoingest mechanism, where the bucket notifies the platform it has objects to load.
The basic infrastructure for Batch is as follows:
The following components are used within Batch
Component | Description | Scale to 0 |
---|---|---|
Intake-API | A deployed service that receives batch jobs to be run, these are placed on a queue which are then handed off to the Batch-Head for scheduling. The Batch head organises the job and the components of the job which is then listed on a listing queue. | Yes |
Queues | A listing queue of jobs to be worked. Jobs are expired after a period of time if they are not accessed. | No |
Batch Head | Batch head takes the jobs from the queue and creates tasks for individual files to be worked on by the Batch-Peers | Yes |
Batch-Peer Pool | This is a group of pods that are the workers that actually take the source files from the customers bucket, apply the Hydrolix transform and output indexed database partitions to the HDX Database bucket. In addition the report the partitions they have created (including other metadata) about the job to the Catalog database | Yes |
HDX DB Storage Bucket | Contains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure. | No |
Batch-Job State | A database that contains the state of batch-jobs. Used to "replay" jobs that may have failed and to track the progress of a Job and its individual tasks. | No |
Catalog | Contains metadata on the database, partitions and job tasks. Forms part of the core infrastructure. | No |
Batch Autoingest
Batch Autoingest uses some additional components. Receiving notifications from the bucket the customer uploads data too. Batch jobs are handed off to the above infrastructure.
Component | Description | Scale to 0 |
---|---|---|
AutoIngest | An additional service used with Autoingest. This service pulls notifications of the job from the bucket notification queue. | Yes |
Bucket Notification Queue | An additional queue the customer provides that is notified by changes to the customers bucket. | No |
Stream API (HTTP)
Stream: Configure it!
To configure Stream for your platform, information can be found here - Streaming API (HTTP)
Streaming Ingest uses an HTTP interface that can receive data as individual or batch messages. Supported file formats include such as CSV, TSV and other "character" separated formats and JSON. In addition schema definitions can either be pre-defined or supplied with the message. Data Enrichment is primarily applied at the Stream-peer layer.
Components for the architecture are as follows:
Component | Description | Scale to 0 |
---|---|---|
App. Load-Balancer | Receives messages via HTTP and routes the host path http://myhost.hydrolix.live/ingest/ to the Stream-Head | Yes |
Stream Head | The Stream Heads will check messages conform to some basic syntax and message structure rules. Any messages failing these tests will cause an HTTP 400 series error to be issued back to the sending client. Successful messages are sent to a listing queue. | Yes |
Queue | An event queue for messages to be worked on by the Stream-Peer | No |
Stream Peer | The Stream Peer servers monitor the queue for jobs and take incoming messages encoding them into database partitions. | Yes |
HDX DB Storage Bucket | Contains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure. | No |
Catalog | Contains metadata on the database, partitions and job tasks. Forms part of the core infrastructure. | No |
Stream Summary
The Stream Summary service is specific to Streaming Ingest and is used in the generation of Summary tables. Summary tables lend themselves well to aggregations of metrics, such as min, max, counts, sum and in some cases percentiles, unique counts, or other statistical aggregations. The basic principle of a Summary table is that at time of data ingest an SQL statement is run against the incoming data with the output being written directly to a new target table.
Components for the architecture are as follows:
Component | Description | Scale to 0 |
---|---|---|
App. Load-Balancer | Receives messages via HTTP and routes the host path http://myhost.hydrolix.live/ingest/ to the Stream-Head | Yes |
Stream Head | The Stream Heads will check messages conform to some basic syntax and message structure rules. Any messages failing these tests will cause an HTTP 400 series error to be issued back to the sending client. Successful messages are sent to a listing queue. | Yes |
Queue | An event queue for messages to be worked on by the Summary-Peer. | No |
Summary-Peer | The Summary-Peer monitor the queue for jobs and take incoming messages encoding them into database partitions. | Yes |
HDX DB Storage Bucket | Contains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure. | No |
Catalog | Contains metadata on the database, partitions and job tasks. Forms part of the core infrastructure. | No |
Kafka
Kafka: Configure it!
To configure Kafka ingestion for your platform, information can be found here - Kafka Streaming Ingestion
Kafka Ingest connects to your current Kafka infrastructure and downloads messages from there. Kafka supports the use of mutual authentication, multiple topics, compressed messages and downloading messages from the start or the end of the off-set.
Components for the architecture are as follows:
Component | Description | Scale to 0 |
---|---|---|
Kafka Peers | Servers within the Kafka Peers pool download messages from a Kafka broker infrastructure, the messages are indexed and then stored on the HDX DB storage bucket. | Yes |
HDX DB Storage Bucket | Contains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure. | No |
Catalog | Contains metadata on the database and partitions Forms part of the core infrastructure. | No |
Customers Kafka | Customers Kafka infrastructure used as a source. | - |
AWS Kinesis
AWS Kinesis: Configure it!
To configure Kinesis ingestion for your platform, information can be found here - Kinesis Streaming Ingestion
Kinesis Ingest connects to your AWS Kinesis infrastructure and downloads messages from there. Kinesis supports the multiple topics, compressed messages and downloading messages from the start or the end of the off-set.
Components for the architecture are as follows:
Component | Description | Scale to 0 |
---|---|---|
Kinesis Peers | Servers within the Kinesis Peers pool download messages from a Kinesis infrastructure, the messages are indexed and then stored on the HDX DB storage bucket. | Yes |
HDX DB Storage Bucket | Contains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure. | No |
Catalog | Contains metadata on the database, partitions and job tasks. Forms part of the core infrastructure. | No |
Customers AWS Kinesis | Customers AWS Kinesis infrastructure used as a source. | - |
Kinesis State | Key value store to manage the state of consumption from the Kinesis Queue. | - |
Updated 5 months ago