Compact & Optimize Data (Merge)
As part of data lifecycle tools Hydrolix provides an automated compaction and optimization service is also made available. This service is called Merge
and is a key service to ensure that data that is ingested into the platform is optimized to provide good query performance and an efficient storage footprint. Simply, the Merge
service takes small partitions and combines them into larger more efficient ones.
Enable Merge on Tables
By default, all tables have merge enabled. You can disable merge with the table API endpoint. Disabling merge will immediately stop new merge jobs from running, but any already queued merge jobs will still run.
{
"name": "<table name>",
"settings": {
"merge": {
"enabled": false
}
}
}
Disabling merge is not recommended, and may result in performance degradation.
Settings
Name | Default | Description |
---|---|---|
enabled | true | Whether or not the merge service queues new merge jobs. |
Merge Resources
Hydrolix clusters automatically create merge components in 3 pools: small, medium and large. Each targets different partition types.
The following two tables show the criteria used to assign partitions to merge pools.
Primary Timestamp | < 1 Hr Ago | 1-24 Hrs Ago | > 1 Day Ago |
---|---|---|---|
< 1 Hr Ago | Small (I) | Medium (II) | Medium (II) |
> 1 Hr Ago and < 90 days ago | Medium (II) | Large (III) |
Consider a partition created two days ago with a Primary
timestamp older than one day. Hydrolix uses the Large (III) pool to merge these partitions.
Primary Timestamp
For more information on primary timestamps, see Timestamp Data Types.
In addition to time, Hydrolix also uses size and maximum aggregate time window to sort merge jobs into pools:
Peer Type | Max Aggregate Size | Max Aggregate Time Window |
---|---|---|
Small (merge) | 1 GB | 5 Minutes |
Medium (merge-II) | 2 GB | 1 Hour |
Large (merge-III) | 4 GB | 1 Day |
Hydrolix uses three sizes to efficiently tackle different partition aggregation levels. This ensures that your cluster achieves optimal sizing and spreads merge workloads across old and new data.
Custom Merge Pools
It can sometimes be helpful to create additional merge pools targeted at specific tables. For example, you might create a special merge pool to handle merge within a Summary Table.
You can create custom merge pools with the pools API endpoint. You can use them with the tables API endpoint.
The following command creates a custom pool with the pools API endpoint over HTTP via curl:
POST {{base_url}}pools/
Authorization: Bearer {{access_token}}
Content-Type: application/json
{
"settings": {
"is_default": false,
"k8s_deployment": {
"service": "merge-peer",
"scale_profile": "II"
}
},
"name": "my-pool-name-II"
}
You can use the following settings to configure your pool:
Object | Description | Value/Example |
---|---|---|
service | The service workload the pool will utilise. For merge this is merge-peer | merge-peer |
scale_profile | The scale profile that should be used for this pool. Scale Profiles are shown here | Optional. Suggest to use default values. Either I , II or III |
name | The name used to identify your pool | Example: my-pool-large |
cpu | The amount of CPU provided to pods. | A numeric value, defaults are specified in Scale Profiles. Example : 2 |
memory | The amount of memory provided to pods. | A string value, defaults are specified in Scale Profiles. Example:10Gi |
replicas | The number of pods to run in the pool. | A numeric value, defaults are specified in Scale Profiles. Example: 3 |
storage | The amount of ephemeral storage provided to pods. | A string value, defaults are specified in Scale Profiles. Example: 5Gi |
The following command assigns a set of custom pools to a table with the tables API endpoint over HTTP via curl:
PATCH {{base_url}}/orgs/{{org_uuid}}/projects/{{project_uuid}}/tables/{{table_uuid}}/
Authorization: Bearer {{access_token}}
Content-Type: application/json
{
"name": "my-table",
"settings": {
"merge": {
"enabled": true,
"pools": {
"large": "my-pool-name-III",
"medium": "my-pool-name-II",
"small": "my-pool-name-I"
}
}
}
}
For optimal merge performance, provide a large, medium, and small pool.
Updated 5 months ago