Ingesting data into the Hydrolix platform can be completed in a number of ways. The platform is deployed with architectural independence for each of the supported methodologies for data ingest.

  • Batch - Load from a storage bucket
  • Stream API (HTTP) - Stream messages for ingest to an HTTP interface
  • Kafka - Ingest from an existing Kafka infrastructure.
  • Kinesis - Ingest from AWS Kinesis infrastructure.

Batch

👍

Batch: Configure it!

To configure Batch for your platform information can be found here - Loading data with Batch

Batch contains two methods for ingest.

  • Batch Load
  • Batch Load with Autoingest

In both cases the batch mechanism pulls in data from a supplied storage bucket. The activation of the load process can be completed either through the API and Portal interfaces or through using the Autoingest mechanism, where the bucket notifies the platform it has objects to load.

The basic infrastructure for Batch is as follows:

1221

The following components are used within Batch

ComponentDescriptionScale to 0
Intake-APIA deployed service that receives batch jobs to be run, these are placed on a queue which are then handed off to the Batch-Head for scheduling. The Batch head organises the job and the components of the job which is then listed on a listing queue.Yes
QueuesA listing queue of jobs to be worked. Jobs are expired after a period of time if they are not accessed.No
Batch HeadBatch head takes the jobs from the queue and creates tasks for individual files to be worked on by the Batch-PeersYes
Batch-Peer PoolThis is a group of pods that are the workers that actually take the source files from the customers bucket, apply the Hydrolix transform and output indexed database partitions to the HDX Database bucket. In addition the report the partitions they have created (including other metadata) about the job to the Catalog databaseYes
HDX DB Storage BucketContains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure.No
Batch-Job StateA database that contains the state of batch-jobs. Used to "replay" jobs that may have failed and to track the progress of a Job and its individual tasks.No
CatalogContains metadata on the database, partitions and job tasks. Forms part of the core infrastructure.No

Batch Autoingest

Batch Autoingest uses some additional components. Receiving notifications from the bucket the customer uploads data too. Batch jobs are handed off to the above infrastructure.

ComponentDescriptionScale to 0
AutoIngestAn additional service used with Autoingest. This service pulls notifications of the job from the bucket notification queue.Yes
Bucket Notification QueueAn additional queue the customer provides that is notified by changes to the customers bucket.No

Stream API (HTTP)

👍

Stream: Configure it!

To configure Stream for your platform, information can be found here - Streaming API (HTTP)

Streaming Ingest uses an HTTP interface that can receive data as individual or batch messages. Supported file formats include such as CSV, TSV and other "character" separated formats and JSON. In addition schema definitions can either be pre-defined or supplied with the message. Data Enrichment is primarily applied at the Stream-peer layer.

1080

Components for the architecture are as follows:

ComponentDescriptionScale to 0
App. Load-BalancerReceives messages via HTTP and routes the host path http://myhost.hydrolix.live/ingest/ to the Stream-HeadYes
Stream HeadThe Stream Heads will check messages conform to some basic syntax and message structure rules. Any messages failing these tests will cause an HTTP 400 series error to be issued back to the sending client. Successful messages are sent to a listing queue.Yes
QueueAn event queue for messages to be worked on by the Stream-PeerNo
Stream PeerThe Stream Peer servers monitor the queue for jobs and take incoming messages encoding them into database partitions.Yes
HDX DB Storage BucketContains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure.No
CatalogContains metadata on the database, partitions and job tasks. Forms part of the core infrastructure.No

Stream Summary

The Stream Summary service is specific to Streaming Ingest and is used in the generation of Summary tables. Summary tables lend themselves well to aggregations of metrics, such as min, max, counts, sum and in some cases percentiles, unique counts, or other statistical aggregations. The basic principle of a Summary table is that at time of data ingest an SQL statement is run against the incoming data with the output being written directly to a new target table.

1248

Components for the architecture are as follows:

ComponentDescriptionScale to 0
App. Load-BalancerReceives messages via HTTP and routes the host path http://myhost.hydrolix.live/ingest/ to the Stream-HeadYes
Stream HeadThe Stream Heads will check messages conform to some basic syntax and message structure rules. Any messages failing these tests will cause an HTTP 400 series error to be issued back to the sending client. Successful messages are sent to a listing queue.Yes
QueueAn event queue for messages to be worked on by the Summary-Peer.No
Summary-PeerThe Summary-Peer monitor the queue for jobs and take incoming messages encoding them into database partitions.Yes
HDX DB Storage BucketContains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure.No
CatalogContains metadata on the database, partitions and job tasks. Forms part of the core infrastructure.No

Kafka

👍

Kafka: Configure it!

To configure Kafka ingestion for your platform, information can be found here - Kafka Streaming Ingestion

Kafka Ingest connects to your current Kafka infrastructure and downloads messages from there. Kafka supports the use of mutual authentication, multiple topics, compressed messages and downloading messages from the start or the end of the off-set.

722

Components for the architecture are as follows:

ComponentDescriptionScale to 0
Kafka PeersServers within the Kafka Peers pool download messages from a Kafka broker infrastructure, the messages are indexed and then stored on the HDX DB storage bucket.Yes
HDX DB Storage BucketContains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure.No
CatalogContains metadata on the database and partitions Forms part of the core infrastructure.No
Customers KafkaCustomers Kafka infrastructure used as a source.-

AWS Kinesis

👍

AWS Kinesis: Configure it!

To configure Kinesis ingestion for your platform, information can be found here - Kinesis Streaming Ingestion

Kinesis Ingest connects to your AWS Kinesis infrastructure and downloads messages from there. Kinesis supports the multiple topics, compressed messages and downloading messages from the start or the end of the off-set.

722

Components for the architecture are as follows:

ComponentDescriptionScale to 0
Kinesis PeersServers within the Kinesis Peers pool download messages from a Kinesis infrastructure, the messages are indexed and then stored on the HDX DB storage bucket.Yes
HDX DB Storage BucketContains the database (including partitions), configuration and other state information concerning the platform. Forms part of the core infrastructure.No
CatalogContains metadata on the database, partitions and job tasks. Forms part of the core infrastructure.No
Customers AWS KinesisCustomers AWS Kinesis infrastructure used as a source.-
Kinesis StateKey value store to manage the state of consumption from the Kinesis Queue.-