Compact & Optimize Data (Merge)

As part of data lifecycle tools, Hydrolix provides an automated compaction and optimization Merge service. This service ensures that data ingested into the platform is optimized for performant queries and an efficient storage footprint. Simply, the Merge service takes small partitions and combines them into larger more efficient ones.

741

Enable Merge on Tables

By default, all tables have merge enabled. You can disable merge with the table API endpoint. Disabling merge will immediately stop new merge jobs from running, but any already queued merge jobs will still run.

{
    "name": "<table name>",
    "settings": {
        "merge": {
            "enabled": false
        }
    }
}

🚧

Disabling merge is not recommended, and may result in performance degradation.

Settings

NameDefaultDescription
enabledtrueWhether or not the merge service queues new merge jobs.

Merge Resources

Hydrolix clusters automatically create merge components in 3 pools: small, medium and large. Each targets different partition types.

The following two tables show the criteria used to assign partitions to merge pools.

Primary Timestamp< 1 Hr Ago1-24 Hrs Ago> 1 Day Ago
< 1 Hr AgoSmall (I)Medium (II)Medium (II)
> 1 Hr Ago
and < 90 days ago
Medium (II)Large (III)

Consider a partition created two days ago with a Primary timestamp older than one day. Hydrolix uses the Large (III) pool to merge these partitions.

📘

Primary Timestamp

For more information on primary timestamps, see Timestamp Data Types.

In addition to time, Hydrolix also uses size and maximum aggregate time window to sort merge jobs into pools:

Peer TypeMax Aggregate SizeMax Aggregate Time Window
Small (merge)1 GB5 Minutes
Medium (merge-II)2 GB1 Hour
Large (merge-III)4 GB1 Day

Hydrolix uses three sizes to efficiently tackle different partition aggregation levels. This ensures that your cluster achieves optimal sizing and spreads merge workloads across old and new data.

Custom Merge Pools

It can sometimes be helpful to create additional merge pools targeted at specific tables. For example, you might create a special merge pool to handle merge within a Summary Table.

You can create custom merge pools with the pools API endpoint. You can use them with the tables API endpoint.

The following command creates a custom pool with the pools API endpoint over HTTP via curl:

POST {{base_url}}pools/
Authorization: Bearer {{access_token}}
Content-Type: application/json

{
     "settings": {
          "is_default": false,
          "k8s_deployment": {
               "service": "merge-peer",
               "scale_profile": "II"
          }
     },
     "name": "my-pool-name-II"
}

You can use the following settings to configure your pool:

ObjectDescriptionValue/Example
serviceThe service workload the pool will utilise. For merge this is merge-peermerge-peer
scale_profileThe scale profile that should be used for this pool. Scale Profiles are shown hereOptional.
Suggest to use default values. Either I, II or III
nameThe name used to identify your poolExample: my-pool-large
cpuThe amount of CPU provided to pods.A numeric value, defaults are specified in Scale Profiles. Example : 2
memoryThe amount of memory provided to pods.A string value, defaults are specified in Scale Profiles. Example:10Gi
replicasThe number of pods to run in the pool.A numeric value, defaults are specified in Scale Profiles. Example: 3
storageThe amount of ephemeral storage provided to pods.A string value, defaults are specified in Scale Profiles. Example: 5Gi

The following command assigns a set of custom pools to a table with the tables API endpoint over HTTP via curl:

PATCH {{base_url}}/orgs/{{org_uuid}}/projects/{{project_uuid}}/tables/{{table_uuid}}/
Authorization: Bearer {{access_token}}
Content-Type: application/json

{
    "name": "my-table",
    "settings": {
        "merge": {
            "enabled": true,
            "pools": {
                "large": "my-pool-name-III",
                "medium": "my-pool-name-II",
                "small": "my-pool-name-I"
            }
        }
    }
}

📘

For optimal merge performance, provide a large, medium, and small pool.