Ingest
Ingestion is the process of moving data into Hydrolix. You can ingest data in several ways:
- Batch - Load bulk data from a storage bucket.
- HTTP Stream API - Stream data to an HTTP interface.
- Kafka - Read messages from Kafka.
- Kinesis - Read messages from AWS Kinesis.
Batch⚓︎
Batch ingestion pulls in data from a supplied storage bucket. You can ingest data with batch in two ways:
- Batch Load: start the load process manually through the API or UI
- Batch Load with Autoingest: start the load process automatically when new objects are available
Batch ingestion uses the following architecture:
| Component | Description | Scale to 0 |
|---|---|---|
| Intake-API | Receives batch jobs and places them on a queue to be processed by the Batch Head. | Yes |
| Batch Head | Takes batch jobs from the queue and creates tasks for individual files to be worked on by Batch Peers. | Yes |
| Batch-Peer Pool | A group of pods that read the source files from the supplied storage bucket, apply the Hydrolix transform, and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog. | Yes |
| Queues | A listing queue of jobs to be worked by a Batch Peer. Expires jobs after a period of time if they're not accessed. | No |
| Batch-Job State | Stores the state of batch jobs. Used to "replay" failed jobs and to track job progress. | No |
| Hydrolix Database Storage Bucket | Contains the partitions that comprise the database. Part of the core infrastructure. | No |
| Catalog | Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure. | No |
Autoingest-Specific Components⚓︎
Batch Load with Autoingest uses the following additional components:
| Component | Description | Scale to 0 |
|---|---|---|
| Bucket Notification Queue | An external queue containing events that describe changes to the customers bucket. | No |
| Autoingest | Takes job notifications from the bucket notification queue. | Yes |
Note
To configure batch ingestion in your Hydrolix cluster, see Batch Ingest.
HTTP Stream API⚓︎
Stream ingestion receives data on an HTTP endpoint, delivered as single or multiple messages in a payload. Stream ingestion supports file formats including JSON, CSV, TSV, and other character-separated-value formats. Clients can either supply a write schema along with incoming payloads or define write schemas in a cluster using a transform before first sending data.
Stream ingestion provides two architectures:
- the simpler intake head architecture (default)
- the older stream head and stream peer pattern
Intake head⚓︎
The more efficient intake head architecture for streaming ingest is used by default.
Use our step-by-step instructions to switch to intake head for the HTTP Stream API. You can run both in parallel to assess functionality and stability.
| Component | Description | Scale to 0 |
|---|---|---|
| Load-Balancer | Receives messages via HTTP and routes the host path http://myhost.hydrolix.live/ingest/ to the Intake-Head. |
Yes |
| Intake-Head | Checks that messages conform to basic syntax and message structure rules. When a message passes these checks, it sends the message to a listing queue. When a message fails these checks, it returns an HTTP 400 response to the client. It applies the Hydrolix transform and outputs indexed database partitions to the Hydrolix Database bucket. It also reports created partition metadata to the Catalog. | Yes |
| Hydrolix Database Storage Bucket | Contains the partitions that comprise the database. Part of the core infrastructure. | No |
| Catalog | Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure. | No |
An optional intake spill feature provides insulation for streaming ingest clients during spikes and catalog sluggishness. This can improve HTTP Stream API responsiveness and availability.
Stream Head and Stream Peer⚓︎
| Component | Description | Scale to 0 |
|---|---|---|
| Load-Balancer | Receives messages via HTTP and routes the host path http://myhost.hydrolix.live/ingest/ to the Stream Head. |
Yes |
| Stream-Head | Check that messages conform to basic syntax and message structure rules. When a message passes these checks, sends the message to a listing queue. When a message fails these checks, returns an HTTP 400 response to the client. | Yes |
| Redpanda | A pub/sub-enabled queue of data to be written by the stream-peer pods. |
No |
| Stream-Peer Pool | A group of pods that take jobs from the queue, apply the Hydrolix transform and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog. | Yes |
| Hydrolix Database Storage Bucket | Contains the partitions that comprise the database. Part of the core infrastructure. | No |
| Catalog | Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure. | No |
Note
To configure stream for your Hydrolix cluster, see HTTP Stream API.
The stream head and stream peer architecture has been replaced by the much more efficient intake head architecture. It's not recommended for new installations.
Kafka⚓︎
Kafka ingestion reads data from a Kafka stream. Kafka ingestion supports:
- mutual authentication
- multiple topics
- compressed messages
- message reads from the start or the end of the offset
Kafka ingestion uses the following infrastructure:
| Component | Description | Scale to 0 |
|---|---|---|
| Kafka-Peer | A group of pods that read jobs from Kafka, apply the Hydrolix transform and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog. | Yes |
| Customer's Kafka | External Kafka infrastructure used as a data source. | - |
| Hydrolix Database Storage Bucket | Contains the partitions that comprise the database. Part of the core infrastructure. | No |
| Catalog | Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure. | No |
Note
To configure Kafka ingestion for your Hydrolix cluster, see Kafka.
AWS Kinesis⚓︎
Kinesis ingestion reads data from AWS Kinesis.
Kinesis ingestion supports:
- multiple topics
- compressed messages
- message reads from the start or the end of the offset
Kinesis ingestion uses the following infrastructure:
| Component | Description | Scale to 0 |
|---|---|---|
| Kinesis Peers | A group of pods that read jobs from Kinesis, apply the Hydrolix transform and output indexed database partitions to the Hydrolix Database bucket. Also report created partition metadata to the Catalog. | Yes |
| Kinesis State | External DynamoDB instance that stores the state of consumption from Kinesis. | - |
| Customer's AWS Kinesis | External AWS Kinesis infrastructure used as a data source. | - |
| Hydrolix Database Storage Bucket | Contains the partitions that comprise the database. Part of the core infrastructure. | No |
| Catalog | Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure. | No |
Note
To configure Kinesis ingestion for your Hydrolix cluster, see Kinesis.
Summary tables⚓︎
Stream ingestion can also produce summary tables. Summary tables store aggregations including min, max, counts, and sum. In some cases, summary tables can store quantiles, unique counts, or other statistical aggregations. During stream ingestion, peers create summary table partitions via SQL statements and write the output to a summary table. For more information about summary tables, see Aggregation with Summary Tables.