Ingestion is the process of moving data into Hydrolix. You can ingest data in several ways:

  • Batch - Load bulk data from a storage bucket.
  • Stream API - Stream data to an HTTP interface.
  • Kafka - Read messages from Kafka.
  • Kinesis - Read messages from AWS Kinesis.

Batch

Batch ingestion pulls in data from a supplied storage bucket. You can ingest data with batch in two ways:

  • Batch Load: start the load process manually via the API or UI
  • Batch Load with Autoingest: start the load process automatically when new objects are available

Batch ingestion uses the following architecture:

ComponentDescriptionScale to 0
Intake-APIReceives batch jobs and places them on a queue to be processed by the Batch Head.Yes
Batch HeadTakes batch jobs from the queue and creates tasks for individual files to be worked on by Batch Peers.Yes
Batch-Peer PoolA group of pods that read the source files from the supplied storage bucket, apply the Hydrolix transform, and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog.Yes
QueuesA listing queue of jobs to be worked by a Batch Peer. Expires jobs after a period of time if they are not accessed.No
Batch-Job StateStores the state of batch jobs. Used to "replay" failed jobs and to track job progress.No
Hydrolix Database Storage BucketContains the partitions that comprise the database. Part of the core infrastructure.No
CatalogContains metadata regarding data stored in Hydrolix. Part of the core infrastructure.No

Autoingest-Specific Components

Batch Load with Autoingest uses the following additional components:

ComponentDescriptionScale to 0
Bucket Notification QueueAn external queue containing events that describe changes to the customers bucket.No
AutoIngestTakes job notifications from the bucket notification queue.Yes

πŸ› οΈ

Try Batch Ingestion

To configure Batch ingestion in your Hydrolix cluster, see ingest with Batch.

Stream

Stream ingestion receives data as individual or batch messages via an HTTP interface. Stream ingestion supports file formats including JSON, CSV, TSV, and other character-separated-value formats. You can either pre-define schemas or supply a schema with each message.

Stream ingestion uses the following architecture:

ComponentDescriptionScale to 0
Load-BalancerReceives messages via HTTP and routes the host path http://myhost.hydrolix.live/ingest/ to the Stream Head.Yes
Stream HeadCheck that messages conform to basic syntax and message structure rules. When a message passes these checks, sends the message to a listing queue. When a message fails these checks, returns an HTTP 400 response to the client.Yes
QueueA listing queue of jobs to be worked by a Stream Peer.No
Stream-Peer PoolA group of pods that take jobs from the queue, apply the Hydrolix transform and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog.Yes
Hydrolix Database Storage BucketContains the partitions that comprise the database. Part of the core infrastructure.No
CatalogContains metadata regarding data stored in Hydrolix. Part of the core infrastructure.No

Summary

Stream ingestion uses the Summary service to generate summary tables. Summary tables store aggregations including min, max, counts, and sum. In some cases, summary tables can store percentiles, unique counts, or other statistical aggregations. During stream ingestion, the summary service runs an SQL statement on the incoming data. The summary service writes the output to a summary table. For more information about summary tables, see Aggregation with Summary Tables.

πŸ› οΈ

Try Stream Ingestion

To configure Stream for your Hydrolix cluster, see Streaming API.

Kafka

Kafka ingestion reads data from a Kafka stream. Kafka ingestion supports:

  • mutual authentication
  • multiple topics
  • compressed messages
  • message reads from the start or the end of the offset

Kafka ingestion uses the following infrastructure:

ComponentDescriptionScale to 0
Kafka-PeerA group of pods that read jobs from Kafka, apply the Hydrolix transform and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog.Yes
Customer's KafkaExternal Kafka infrastructure used as a data source.-
Hydrolix Database Storage BucketContains the partitions that comprise the database. Part of the core infrastructure.No
CatalogContains metadata regarding data stored in Hydrolix. Part of the core infrastructure.No

πŸ› οΈ

Try Kafka Ingestion

To configure Kafka ingestion for your Hydrolix cluster, see Ingest via Kafka.

AWS Kinesis

Kinesis ingestion reads data from AWS Kinesis.

Kinesis ingestion supports:

  • multiple topics
  • compressed messages
  • message reads from the start or the end of the offset

Kinesis ingestion uses the following infrastructure:

ComponentDescriptionScale to 0
Kinesis PeersA group of pods that read jobs from Kinesis, apply the Hydrolix transform and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog.Yes
Kinesis StateExternal DynamoDB instance that stores the state of consumption from Kinesis.-
Customer's AWS KinesisExternal AWS Kinesis infrastructure used as a data source.-
Hydrolix Database Storage BucketContains the partitions that comprise the database. Part of the core infrastructure.No
CatalogContains metadata regarding data stored in Hydrolix. Part of the core infrastructure.No

πŸ› οΈ

Try AWS Kinesis Ingestion

To configure Kinesis ingestion for your Hydrolix cluster, see Ingest via Kinesis.


What’s Next