Ingest
Ingesting data into the Hydrolix platform can be completed in a number of ways. The platform is deployed with architectural independence for each of the supported methodologies for data ingest.
- Batch - Load from a storage bucket
- Stream API (HTTP) - Stream messages for ingest to an HTTP interface
- Kafka - Ingest from an existing Kafka infrastructure.
Batch
Batch: Configure it!
To configure Batch for your platform information can be found here - Loading data with Batch
Batch contains two methods for ingest.
- Batch Load
- Batch Load with Autoingest
In both cases the batch mechanism pulls in data from a supplied storage bucket. The activation of the load process can be completed either through the API and Portal interfaces or through using the Autoingest mechanism, where the bucket notifies the platform it has objects to load.
The basic infrastructure for Batch is as follows:
The following components are used within Batch
Component | Description | Scale to 0 |
---|---|---|
Intake-Misc | A deployed server running a number of micro-services. The API micro-services receives batch jobs to be run, these are then handed off to the Batch-Head for scheduling. The Batch head organises the job and the components of the job which is then listed on a listing queue. | Yes |
Queues | Two queues are operated, a listing queue for a list of jobs to be worked on and a dead-letter queue, where jobs are expired to after a period of time if they are not accessed. | No |
Batch Peer Server Pool | This is a group of servers that are the workers that actually take the source files from the customers bucket, apply the Hydrolix transform and output indexed database partitions to the HDX Database bucket. In addition the report the partitions they have created (including other metadata) about the job to the Catalog database | Yes |
HDX DB Storage Bucket | Contains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure. | No |
Batch-Job State | A databased that contains the state of batch-jobs. Used to "replay" jobs that may have failed and to track the progress of a Job and its individual tasks. | No |
Catalog | Contains metadata on the database, partitions and job tasks. Forms part of the core infrastructure. | No |
Batch Autoingest
Batch Autoingest uses some additional components. Receiving notifications from the bucket the customer uploads data too. Batch jobs are handed off to the above infrastructure.
Component | Description | Scale to 0 |
---|---|---|
Intake-Misc | An additional microservice is used with Autoingest. This service pulls notifications of the job from the bucket notification queue | Yes |
Queue | An additional two queues, an Autoingest notification Queue that is notified by changes to the customers bucket and a dead-letter queue for any messages that expire from the Autoingest queue. | No |
Stream API (HTTP)
Stream: Configure it!
To configure Stream for your platform, information can be found here - Streaming API (HTTP)
Streaming Ingest uses an HTTP interface that can receive data as individual or batch messages. Supported file formats include such as CSV, TSV and other "character" separated formats and JSON. In addition schema definitions can either be pre-defined or supplied with the message.
Components for the architecture are as follows:
Component | Description | Scale to 0 |
---|---|---|
Web Server | Messages are received via the host path http://myhost.hydrolix.live/ingest/ and route by the Application Load-balancer service on the Web server. | Yes |
Stream Head | The Stream Head servers will check messages conform to some basic syntax and message structure rules. Any messages failing these tests will cause an HTTP 400 series error to be issued back to the sending client. Successful messages are sent to a listing queue. | Yes |
Queue | Two queues are operated, an event queue for messages to be worked on and a dead-letter queue, where messages are expired to after a period of time if they are not processed. | No |
Stream Peer | The Stream Peer Pool servers monitor this queue for jobs and take incoming messages encoding them into database partitions. | Yes |
Zookeeper | Zookeeper servers are used for cluster management of the Stream Peer Server Pool. | Yes |
HDX DB Storage Bucket | Contains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure. | No |
Catalog | Contains metadata on the database, partitions and job tasks. Forms part of the core infrastructure. | No |
Kafka
Kafka: Configure it!
To configure Kafka ingestion for your platform, information can be found here - Kafka Streaming Ingestion
Kafka Ingest connects to your current Kafka infrastructure and downloads messages from there. Kafka supports the use of mutual authentication, multiple topics, compressed messages and downloading messages from the start or the end of the off-set.
Components for the architecture are as follows:
Component | Description | Scale to 0 |
---|---|---|
Kafka Peers | Servers within the Kafka Peers pool download messages from a Kafka broker infrastructure, the messages are indexed and then stored on the HDX DB storage bucket. | Yes |
HDX DB Storage Bucket | Contains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure. | No |
Catalog | Contains metadata on the database, partitions and job tasks. Forms part of the core infrastructure. | No |
Updated about 1 month ago