Ingesting data into the Hydrolix platform can be completed in a number of ways. The platform is deployed with architectural independence for each of the supported methodologies for data ingest.

  • Batch - Load from a storage bucket
  • Stream API (HTTP) - Stream messages for ingest to an HTTP interface
  • Kafka - Ingest from an existing Kafka infrastructure.

Batch

πŸ‘

Batch: Configure it!

To configure Batch for your platform information can be found here - Loading data with Batch

Batch contains two methods for ingest.

  • Batch Load
  • Batch Load with Autoingest

In both cases the batch mechanism pulls in data from a supplied storage bucket. The activation of the load process can be completed either through the API and Portal interfaces or through using the Autoingest mechanism, where the bucket notifies the platform it has objects to load.

The basic infrastructure for Batch is as follows:

The following components are used within Batch

ComponentDescriptionScale to 0
Intake-MiscA deployed server running a number of micro-services. The API micro-services receives batch jobs to be run, these are then handed off to the Batch-Head for scheduling. The Batch head organises the job and the components of the job which is then listed on a listing queue.Yes
QueuesTwo queues are operated, a listing queue for a list of jobs to be worked on and a dead-letter queue, where jobs are expired to after a period of time if they are not accessed.No
Batch Peer Server PoolThis is a group of servers that are the workers that actually take the source files from the customers bucket, apply the Hydrolix transform and output indexed database partitions to the HDX Database bucket. In addition the report the partitions they have created (including other metadata) about the job to the Catalog databaseYes
HDX DB Storage BucketContains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure.No
Batch-Job StateA databased that contains the state of batch-jobs. Used to "replay" jobs that may have failed and to track the progress of a Job and its individual tasks.No
CatalogContains metadata on the database, partitions and job tasks. Forms part of the core infrastructure.No

Batch Autoingest

Batch Autoingest uses some additional components. Receiving notifications from the bucket the customer uploads data too. Batch jobs are handed off to the above infrastructure.

ComponentDescriptionScale to 0
Intake-MiscAn additional microservice is used with Autoingest. This service pulls notifications of the job from the bucket notification queueYes
QueueAn additional two queues, an Autoingest notification Queue that is notified by changes to the customers bucket and a dead-letter queue for any messages that expire from the Autoingest queue.No

Stream API (HTTP)

πŸ‘

Stream: Configure it!

To configure Stream for your platform, information can be found here - Streaming API (HTTP)

Streaming Ingest uses an HTTP interface that can receive data as individual or batch messages. Supported file formats include such as CSV, TSV and other "character" separated formats and JSON. In addition schema definitions can either be pre-defined or supplied with the message.

Components for the architecture are as follows:

ComponentDescriptionScale to 0
Web ServerMessages are received via the host path http://myhost.hydrolix.live/ingest/ and route by the Application Load-balancer service on the Web server.Yes
Stream HeadThe Stream Head servers will check messages conform to some basic syntax and message structure rules. Any messages failing these tests will cause an HTTP 400 series error to be issued back to the sending client. Successful messages are sent to a listing queue.Yes
QueueTwo queues are operated, an event queue for messages to be worked on and a dead-letter queue, where messages are expired to after a period of time if they are not processed.No
Stream PeerThe Stream Peer Pool servers monitor this queue for jobs and take incoming messages encoding them into database partitions.Yes
ZookeeperZookeeper servers are used for cluster management of the Stream Peer Server Pool.Yes
HDX DB Storage BucketContains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure.No
CatalogContains metadata on the database, partitions and job tasks. Forms part of the core infrastructure.No

Kafka

πŸ‘

Kafka: Configure it!

To configure Kafka ingestion for your platform, information can be found here - Kafka Streaming Ingestion

Kafka Ingest connects to your current Kafka infrastructure and downloads messages from there. Kafka supports the use of mutual authentication, multiple topics, compressed messages and downloading messages from the start or the end of the off-set.

Components for the architecture are as follows:

ComponentDescriptionScale to 0
Kafka PeersServers within the Kafka Peers pool download messages from a Kafka broker infrastructure, the messages are indexed and then stored on the HDX DB storage bucket.Yes
HDX DB Storage BucketContains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure.No
CatalogContains metadata on the database, partitions and job tasks. Forms part of the core infrastructure.No

Did this page help you?