Compact & Optimize Data (Merge)
As part of data lifecycle tools Hydrolix provides an automated compaction and optimization service is also made available. This service is called Merge
and is a key service to ensure that data that is ingested into the platform is optimized to provide good query performance and an efficient storage footprint. Simply, the Merge
service takes small partitions and combines them into larger more efficient ones.
Enabling Merge on Tables.
By default all tables have merge enabled and it is recommended to not turn this off. Should you require to turn it off this can be done using the table API end-point - Tables. Setting enabled to false
will cease any new merge jobs running. Note, any merge jobs that are already queue'd up will still be completed.
{
"name": "my-table",
"settings": {
"merge": {
"enabled": false
}
}
}
Settings
Name | Description | Default |
---|---|---|
enabled | activates merge service | true |
Merge Resources
Within a deployment default merge components are created automatically. They are created in 3 pools - small, medium and large - with each targeting different types and age of partition for optimisation.
The following two tables (combined) show the criteria used to assign partitions to merge pool.
Primary Timestamp / Partition Created (below) | < 1 Hr Ago | 1-24 Hrs Ago | > 1 Day Ago |
---|---|---|---|
< 1 Hr Ago | Small (I) | Medium (II) | Medium (II) |
> 1 Hr Ago and < 90 days ago | Medium (II) | Large (III) |
For example if a partition has been created two days ago and its Primary
timestamp is greater than one day from now()
, then the Large (III) pool will be used to merge these partitions.
Primary Timestamp
More information on primary timestamps can be found here - Timestamp Data Types
In addition to time there are also qualifying criteria for the size and maximum aggregated time window contained within a partition.
Peer Type | Max Aggregate Size | Max Aggregate Time |
---|---|---|
Small (merge) | Max 1 GB | 5 Minutes |
Medium (merge-II) | Max 2 GB | 1 Hour |
Large (merge-III) | Max 4 GB | 1 Day |
The reasoning behind having three stages or sizes is to efficiently tackle different aggregation levels of partition ensuring optimised sizing is reached and while ensuring merge workloads are spread suitably across old and new data. The default pool will attempt to merge all tables with merge set to true
in the table settings.
Custom Merge Pools
On occasion it maybe beneficial to create further merge-pools targeted at specific tables for example if a Summary Table is to be used. This can be achieved through creating the merge-pools necessary using the pools API end-point - /v1/pools/ - and then updating the table API - Tables.
For example to create a pool:
POST {{base_url}}pools/
Authorization: Bearer {{access_token}}
Content-Type: application/json
{
"settings": {
"is_default": false,
"k8s_deployment": {
"service": "merge-peer",
"scale_profile": "II"
}
},
"name": "my-pool-name-II"
}
When configuring your pool the following settings should be used:
Object | Description | Value/Example |
---|---|---|
service | The service workload the pool will utilise. For merge this is merge-peer | merge-peer |
scale_profile | The scale profile that should be used for this pool. Scale Profiles are shown here | Optional. Suggest to use default values. Either I , II or III |
name | The name used to identify your pool | Example: my-pool-large |
cpu | The amount of CPU uitilised by the pod. | A numeric value, defaults are specified in Scale Profiles. Example : 2 |
memory | The amount of memory the pods will have. | A String value, defaults are specified in Scale Profiles. Example:10Gi |
replicas | The number of pods to run in the pool. | A numeric value, defaults are specified in Scale Profiles. Example: 3 |
storage | The amount of ephemeral storage the pod will have access to. | A String value, defaults are specified in Scale Profiles. Example: 5Gi |
Note
It is suggested to have a pool for each of the types (
Small
,Medium
,Large
) to ensure optimisations are completed as expected.
The table settings ( Tables) are updated as follows:
PATCH {{base_url}}/orgs/{{org_uuid}}/projects/{{project_uuid}}/tables/{{table_uuid}}/
Authorization: Bearer {{access_token}}
Content-Type: application/json
{
"name": "my-table",
"settings": {
"merge": {
"enabled": true,
"pools": {
"large": "my-pool-name-III",
"medium": "my-pool-name-II",
"small": "my-pool-name-I"
}
}
}
}
The pools are assigned to either the small
, medium
or large
workloads which correspond to the workload criteria specified in the section above.
Updated 25 days ago