Compacting & Optimizing Data (Merge)

As part of data lifecycle tools Hydrolix provides an automated compaction and optimization service is also made available. This service is called Merge and is a key service to ensure that data that is ingested into the platform is optimized to provide good query performance and an efficient storage footprint. Simply, the Merge service takes small partitions and combines them into larger more efficient ones.

741741

Enabling Merge on Tables.

By default all tables have merge enabled and it is recommended to not turn this off. Should you require to turn it off this can be done using the table API end-point - Tables. Setting enabled to false will cease any new merge jobs running. Note, any merge jobs that are already queue'd up will still be completed.

{
    "name": "my-table",
    "settings": {
        "merge": {
            "enabled": false
        }
    }
}

Settings

NameDescriptionDefault
enabledactivates merge servicetrue

Merge Resources

Within a deployment default merge components are created automatically. They are created in 3 pools - small, medium and large - with each targeting different types and age of partition for optimisation.

The following two tables (combined) show the criteria used to assign partitions to merge pool.

Primary Timestamp /
Partition Created (below)
< 1 Hr Ago1-24 Hrs Ago> 1 Day Ago
< 1 Hr AgoSmall (I)Medium (II)Medium (II)
> 1 Hr Ago
and < 90 days ago
Medium (II)Large (III)

For example if a partition has been created two days ago and its Primary timestamp is greater than one day from now(), then the Large (III) pool will be used to merge these partitions.

📘

Primary Timestamp

More information on primary timestamps can be found here - Timestamp Data Types

In addition to time there are also qualifying criteria for the size and maximum aggregated time window contained within a partition.

Peer TypeMax Aggregate SizeMax Aggregate Time
Small (merge)Max 1 GB5 Minutes
Medium (merge-II)Max 2 GB1 Hour
Large (merge-III)Max 4 GB1 Day

The reasoning behind having three stages or sizes is to efficiently tackle different aggregation levels of partition ensuring optimised sizing is reached and while ensuring merge workloads are spread suitably across old and new data. The default pool will attempt to merge all tables with merge set to true in the table settings.

Custom Merge Pools

On occasion it maybe beneficial to create further merge-pools targeted at specific tables for example if a Summary Table is to be used. This can be achieved through creating the merge-pools necessary using the pools API end-point - /v1/pools/ - and then updating the table API - Tables.

For example to create a pool:

POST {{base_url}}pools/
Authorization: Bearer {{access_token}}
Content-Type: application/json

{
     "settings": {
          "is_default": false,
          "k8s_deployment": {
               "service": "merge-peer",
               "scale_profile": "II"
          }
     },
     "name": "my-pool-name-II"
}

When configuring your pool the following settings should be used:

ObjectDescriptionValue/Example
serviceThe service workload the pool will utilise. For merge this is merge-peermerge-peer
scale_profileThe scale profile that should be used for this pool. Scale Profiles are shown hereOptional.
Suggest to use default values. Either I, II or III
nameThe name used to identify your poolExample: my-pool-large
cpuThe amount of CPU uitilised by the pod.A numeric value, defaults are specified in Scale Profiles. Example : 2
memoryThe amount of memory the pods will have.A String value, defaults are specified in Scale Profiles. Example:10Gi
replicasThe number of pods to run in the pool.A numeric value, defaults are specified in Scale Profiles. Example: 3
storageThe amount of ephemeral storage the pod will have access to.A String value, defaults are specified in Scale Profiles. Example: 5Gi

📘

Note

It is suggested to have a pool for each of the types (Small, Medium, Large) to ensure optimisations are completed as expected.

The table settings ( Tables) are updated as follows:

PATCH {{base_url}}/orgs/{{org_uuid}}/projects/{{project_uuid}}/tables/{{table_uuid}}/
Authorization: Bearer {{access_token}}
Content-Type: application/json

{
    "name": "my-table",
    "settings": {
        "merge": {
            "enabled": true,
            "pools": {
                "large": "my-pool-name-III",
                "medium": "my-pool-name-II",
                "small": "my-pool-name-I"
            }
        }
    }
}

The pools are assigned to either the small, medium or large workloads which correspond to the workload criteria specified in the section above.