Ingest

Ingestion is the process of moving data into Hydrolix. You can ingest data in several ways:

Batch - Load bulk data from a storage bucket.
HTTP Stream - Stream data to an HTTP interface.
Kafka - Read messages from Kafka.
Kinesis - Read messages from AWS Kinesis.

Batch

Batch ingestion pulls in data from a supplied storage bucket. You can ingest data with batch in two ways:

Batch Load: start the load process manually via the API or UI
Batch Load with Autoingest: start the load process automatically when new objects are available

Batch ingestion uses the following architecture:

Component	Description	Scale to 0
Intake-API	Receives batch jobs and places them on a queue to be processed by the Batch Head.	Yes
Batch Head	Takes batch jobs from the queue and creates tasks for individual files to be worked on by Batch Peers.	Yes
Batch-Peer Pool	A group of pods that read the source files from the supplied storage bucket, apply the Hydrolix transform, and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog.	Yes
Queues	A listing queue of jobs to be worked by a Batch Peer. Expires jobs after a period of time if they are not accessed.	No
Batch-Job State	Stores the state of batch jobs. Used to "replay" failed jobs and to track job progress.	No
Hydrolix Database Storage Bucket	Contains the partitions that comprise the database. Part of the core infrastructure.	No
Catalog	Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure.	No

Autoingest-Specific Components

Batch Load with Autoingest uses the following additional components:

Component	Description	Scale to 0
Bucket Notification Queue	An external queue containing events that describe changes to the customers bucket.	No
AutoIngest	Takes job notifications from the bucket notification queue.	Yes

🛠️
Try Batch Ingestion
To configure Batch ingestion in your Hydrolix cluster, see ingest with Batch.

HTTP Stream

Stream ingestion receives data as individual or batch messages via an HTTP interface. Stream ingestion supports file formats including JSON, CSV, TSV, and other character-separated-value formats. You can either predefine schemas via a transform or supply a schema with each message.

Stream ingestion provides two architectures:

the newer, simpler Intake-Heads, and
the older, proven Stream-Head and Stream-Peers pool.

The older stream version is used by default, but the newer Intake-Head architecture has been proven in the field, is more stable, and uses far fewer resources. We recommend following our guide to adopt the new architecture. You can run both in parallel to assess functionality and stability.

Intake Head

Component	Description	Scale to 0
Load-Balancer	Receives messages via HTTP and routes the host path `http://myhost.hydrolix.live/ingest/` to the Intake-Head.	Yes
Intake-Head	Checks that messages conform to basic syntax and message structure rules. When a message passes these checks, it sends the message to a listing queue. When a message fails these checks, it returns an HTTP 400 response to the client. It applies the Hydrolix transform and outputs indexed database partitions to the Hydrolix Database bucket. It also reports created partition metadata to the Catalog.	Yes
Hydrolix Database Storage Bucket	Contains the partitions that comprise the database. Part of the core infrastructure.	No
Catalog	Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure.	No

The Intake Head supports an internal backlog to keep cloud storage latency from blocking ingest requests. This generally isn't needed, but it can be enabled via the intake_head_index_backlog_enabled setting described in the Configuration Options Reference.

Stream Head and Stream Peer

Component	Description	Scale to 0
Load-Balancer	Receives messages via HTTP and routes the host path `http://myhost.hydrolix.live/ingest/` to the Stream Head.	Yes
Stream-Head	Check that messages conform to basic syntax and message structure rules. When a message passes these checks, sends the message to a listing queue. When a message fails these checks, returns an HTTP 400 response to the client.	Yes
Redpanda	A pub/sub-enabled queue of data to be written by the `stream-peer` pods.	No
Stream-Peer Pool	A group of pods that take jobs from the queue, apply the Hydrolix transform and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog.	Yes
Hydrolix Database Storage Bucket	Contains the partitions that comprise the database. Part of the core infrastructure.	No
Catalog	Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure.	No

🛠️
Try Stream Ingestion
To configure Stream for your Hydrolix cluster, see Streaming API.

Kafka

Kafka ingestion reads data from a Kafka stream. Kafka ingestion supports:

mutual authentication
multiple topics
compressed messages
message reads from the start or the end of the offset

Kafka ingestion uses the following infrastructure:

Component	Description	Scale to 0
Kafka-Peer	A group of pods that read jobs from Kafka, apply the Hydrolix transform and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog.	Yes
Customer's Kafka	External Kafka infrastructure used as a data source.	-
Hydrolix Database Storage Bucket	Contains the partitions that comprise the database. Part of the core infrastructure.	No
Catalog	Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure.	No

🛠️
Try Kafka Ingestion
To configure Kafka ingestion for your Hydrolix cluster, see Ingest via Kafka.

AWS Kinesis

Kinesis ingestion reads data from AWS Kinesis.

Kinesis ingestion supports:

multiple topics
compressed messages
message reads from the start or the end of the offset

Kinesis ingestion uses the following infrastructure:

Component	Description	Scale to 0
Kinesis Peers	A group of pods that read jobs from Kinesis, apply the Hydrolix transform and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog.	Yes
Kinesis State	External DynamoDB instance that stores the state of consumption from Kinesis.	-
Customer's AWS Kinesis	External AWS Kinesis infrastructure used as a data source.	-
Hydrolix Database Storage Bucket	Contains the partitions that comprise the database. Part of the core infrastructure.	No
Catalog	Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure.	No

🛠️
Try AWS Kinesis Ingestion
To configure Kinesis ingestion for your Hydrolix cluster, see Ingest via Kinesis.

Summary

Stream ingestion can also produce summary tables. Summary tables store aggregations including min, max, counts, and sum. In some cases, summary tables can store quantiles, unique counts, or other statistical aggregations. During stream ingestion, peers create summary table partitions via SQL statements and write the output to a summary table. For more information about summary tables, see Aggregation with Summary Tables.

Ingest

Batch

Autoingest-Specific Components

🛠️
Try Batch Ingestion

HTTP Stream

Intake Head

Stream Head and Stream Peer

🛠️
Try Stream Ingestion

Kafka

🛠️
Try Kafka Ingestion

AWS Kinesis

🛠️
Try AWS Kinesis Ingestion

Summary

Batch

Autoingest-Specific Components

🛠️Try Batch Ingestion

HTTP Stream

Intake Head

Stream Head and Stream Peer

🛠️Try Stream Ingestion

Kafka

🛠️Try Kafka Ingestion

AWS Kinesis

🛠️Try AWS Kinesis Ingestion

Summary

🛠️
Try Batch Ingestion

🛠️
Try Stream Ingestion

🛠️
Try Kafka Ingestion

🛠️
Try AWS Kinesis Ingestion