Merge
Overview
Merge is a data lifecycle service that organizes Hydrolix data into an optimal state. Merge runs periodically in Hydrolix clusters.
Hydrolix can ingest data out of order. Because Hydrolix makes data available quickly, out-of-order ingestion can initially create sub-optimal partitions. This sub-optimal partition structure can lead to inefficient compression and performance.
Architecture
Merge uses the following architecture:
Component | Description | Scale to 0 |
---|---|---|
Merge head | Uses the Catalog to determine which partitions to combine. Sends messages describing these combine tasks to a queue. | Yes |
Queue (RabbitMQ) | Contains a list of partition combine tasks to be worked on. | No |
Merge peer | A group of workers that take partition combine tasks from the queue. Reads partitions from storage and combines them to create new partitions. Writes the new combined partitions to the Hydrolix Database. Finally, updates the catalog with the new partition and removes the old partitions. | Yes |
Hydrolix database storage bucket | Contains the partitions that comprise the database. Part of the core infrastructure. | No |
Catalog | Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure. | No |
Configure Merge
To configure Merge in your Hydrolix cluster, see Merge.
Merge controller (v5.3.0+)
See Enable merge controller to enable the merge controller.
Merge controller is a recommended drop-in replacement for merge head. Architecturally, merge controller and merge head only differ in how they communicate with merge peers.
merge-head
uses RabbitMQ as an intermediate communication channel with the merge-peer
pools. Each pool has its own queue, and merge-head
dispatches messages to a queue for a merge-peer
to consume.
merge-controller
doesn't use RabbitMQ. Instead, each merge-peer
connects directly to the merge-controller
via gRPC. This direct line of communication allows merge-controller
greater visibility and control over the system as a whole, eliminating some of the largest inefficiencies in the merge system.
Architecture
Component | Description | Scale to 0 |
---|---|---|
Merge controller | A drop-in replacement for merge head. Determine which partitions should be combined to improve query performance and storage costs. Communicates merge tasks to merge peers using gRPC channels. | Yes |
gRPC channel | Intermediate component between the merge controller which issues merge tasks and the merge peers which combine smaller partitions into larger partitions. | Yes |
Merge peer | A group of workers that take partition combine tasks from the queue. Reads partitions from storage and combines them to create new partitions. Writes the new combined partitions to the Hydrolix Database. Finally, updates the catalog with the new partition and removes the old partitions. | Yes |
Hydrolix database storage bucket | Contains the partitions that comprise the database. Part of the core infrastructure. | No |
Catalog | Contains metadata regarding data stored in Hydrolix. Part of the core infrastructure. | No |
Updated 17 days ago