Compact & Optimize Data (Merge)
Hydrolix includes an automated compaction and optimization Merge
service as part of its data lifecycle. The Merge
service takes small partitions and combines them into larger, more efficient ones, optimizing data for more performant queries and a smaller storage footprint.
Below is a diagram of data partitions in storage. During the merge process, three smaller partitions on the left are combined to create one larger partition on the right with a longer time interval.
For more information about how the Merge service fits into the rest of Hydrolix, see the Merge page in our platform documentation.
Enable Merge on Tables
All tables have merge enabled by default. You can enable and disable merge with the PATCH table API endpoint. Disabling merge will immediately stop new merge jobs from running, but any merge jobs already in the queue will still run.
Disable merge only under special circumstances
Disabling merge is not recommended, and may result in performance degradation.
For example, this API request will enable merge for a given table:
PATCH {{base_url}}orgs/{{org_id}}/projects/{{project_id}}/tables/{{table_id}}
Authorization: Bearer {{access_token}}
Content-Type: application/json
Accept: application/json
{
"settings": {
"merge": {
"enabled": true
}
}
}
You can also disable merge for a given table through the web UI. Navigate to "Data," select the table you want, then find "merge settings" under “Advanced options” and click on the three dots in that row on the right. You can then select the "Disable Merge" checkbox from the menu.
Disable merge only under special circumstances
Disabling merge is not recommended, and may result in performance degradation.
Merge Pools
Hydrolix clusters create merge components in three pools: small, medium and large. These three sizes each handle different partitions that are differentiated by several criteria. This ensures that your cluster achieves optimal partition sizing and spreads merge workloads across old and new data.
The following table shows the criteria used to assign partitions to merge pools, as of version 4.17.0.
If the max Primary Timestamp is: | ...and the partition width is within: | Resulting Merge Pool | Target Partition Size |
---|---|---|---|
Less than 10 minutes old | 1 hour | small (merge-i) | 1 GB |
From 10 to 70 minutes old | 1 hour | medium (merge-ii) | 2 GB |
Greater than 70 minutes old | 1 hour | large (merge-iii) | 4 GB |
For example, reading across the table above from left to right: if a partition's last timestamp was 15 minutes ago, and it was 513 MB in size and 37 minutes in width, it would be sent to the medium pool.
Primary timestamp
For more information on primary timestamps, see Timestamp Data Types.
Custom Merge Pools
All tables have merge enabled by default, but sometimes you might want to create additional merge pools targeted at specific tables to separate merge workloads and avoid “noisy neighbor” effects. For example, you might create a special merge pool to handle merge within a Summary Table, distancing that workload from the main merge process.
Create custom merge pools with the pools API endpoint, then apply those pools to tables with the tables API endpoint.
Creating Pools
The following Config API command creates a custom pool by means of the pools API endpoint over HTTP:
POST {{base_url}}pools/
Authorization: Bearer {{access_token}}
Content-Type: application/json
{
"settings": {
"is_default": false,
"k8s_deployment": {
"service": "merge-peer",
"scale_profile": "II"
}
},
"name": "my-pool-name-II"
}
You can also do this in the UI by selecting the "Add new" upper right-hand menu, then "Resource pool."
Use the following settings to configure your pool:
Object | Description | Value/Example |
---|---|---|
service | The service workload the pool will utilize. For merge, this is merge-peer . | merge-peer |
scale_profile | The merge pool size, corresponding to small, medium, or large. | I , II or III |
name | The name used to identify your pool. | Example: my-pool-name-II |
cpu | The amount of CPU provided to pods. | A numeric value, defaults are specified in Scale Profiles. Example : 2 |
memory | The amount of memory provided to pods. | A string value, defaults are specified in Scale Profiles. Default units are Gi . Example:10Gi |
replicas | The number of pods to run in the pool. | A numeric value or hyphenated range. Defaults are specified in Scale Profiles. Examples: 3 and 1-5 |
storage | The amount of ephemeral storage provided to pods. | A string value, defaults are specified in Scale Profiles. Default units are Gi . Example: 5Gi |
Assigning Pools to Tables
The following API request assigns a set of custom pools to a table with the tables API endpoint:
PATCH {{base_url}}/orgs/{{org_uuid}}/projects/{{project_uuid}}/tables/{{table_uuid}}/
Authorization: Bearer {{access_token}}
Content-Type: application/json
{
"name": "my-table",
"settings": {
"merge": {
"enabled": true,
"pools": {
"large": "my-pool-name-III",
"medium": "my-pool-name-II",
"small": "my-pool-name-I"
}
}
}
}
You can also configure this in the UI. Navigate to "Data", select the table to which you want to assign new pools, then find "Merge settings" under "Advanced options." You'll see this menu:
Use all three pools
For optimal merge performance, provide a large, medium, and small pool.
Updated about 1 month ago