Skip to content

Overview

Hydrolix includes an automated compaction and optimization Merge service as part of its data lifecycle. This service is enabled by default for all tables.

The Merge service combines small partitions into larger ones, improving compression efficiency and decreasing partition count. This results in better-performing queries and a smaller storage footprint for the same data.

This diagram shows data partitions in storage. During the merge process, three smaller partitions are combined into one larger partition with a longer time interval.

Multiple data partitions in one time window are merged into a single partition Multiple data partitions in one time window are merged into a single partition

For more information about how the Merge service fits into the Hydrolix architecture, see the Merge platform overview.

Merge controller⚓︎

merge-controller was introduced in Hydrolix version 5.3.0 and became the default merge service in v5.10, replacing merge-head. It adds operational, observability, and performance improvements over merge-head.

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'background': 'transparent',
    'fontSize': '18px',
    'edgeLabelBackground': 'transparent',
    'primaryColor': 'transparent',
    'primaryBorderColor': '#003D66',
    'primaryTextColor': '#424D57',
    'lineColor': '#003D66'
  },
  'flowchart': { 'padding': 20, 'nodeSpacing': 60, 'rankSpacing': 60, 'curve': 'basis' }
}}%%
flowchart LR
    CAT[(Catalog)]
    MC[merge-controller]
    P[merge-peer pools]
    S[(Storage)]
    CAT -->|partition stream| MC
    MC -->|candidates over gRPC| P
    P -->|R/W| S
    P -->|completion report| MC
    style CAT fill:#F4F6F8,stroke:#003D66,stroke-width:2px,color:#003D66
    style MC fill:none,stroke:#003D66,stroke-width:2px,color:#424D57
    style P fill:none,stroke:#00A99D,stroke-width:2px,color:#035F60
    style S fill:#F4F6F8,stroke:#003D66,stroke-width:2px,color:#003D66
%%{init: {
  'theme': 'base',
  'themeVariables': {
    'background': 'transparent',
    'fontSize': '18px',
    'edgeLabelBackground': 'transparent',
    'primaryColor': 'transparent',
    'primaryBorderColor': '#88C9F2',
    'primaryTextColor': '#FAFBFC',
    'lineColor': '#88C9F2'
  },
  'flowchart': { 'padding': 20, 'nodeSpacing': 60, 'rankSpacing': 60, 'curve': 'basis' }
}}%%
flowchart LR
    CAT[(Catalog)]
    MC[merge-controller]
    P[merge-peer pools]
    S[(Storage)]
    CAT -->|partition stream| MC
    MC -->|candidates over gRPC| P
    P -->|R/W| S
    P -->|completion report| MC
    style CAT fill:#092747,stroke:#6EF6D9,stroke-width:2px,color:#6EF6D9
    style MC fill:#092747,stroke:#88C9F2,stroke-width:2px,color:#88C9F2
    style P fill:#092747,stroke:#00A99D,stroke-width:2px,color:#6EF6D9
    style S fill:#092747,stroke:#6EF6D9,stroke-width:2px,color:#6EF6D9

merge-head is deprecated

merge-head is deprecated as of v5.10 and will be removed in v5.12. If your configuration explicitly uses merge-head, migrate to merge-controller before upgrading to v5.12.

Apply these changes to your hydrolixcluster.yaml:

1
2
3
4
5
6
7
spec:
  merge_controller_enabled: true
  scale:
    merge-controller:
      replicas: 1
    merge-head:
      replicas: 0

Apply these changes in a single update:

  • The merge-head pod shuts down
  • A merge-controller pod starts up
  • All merge-peer pods restart in all pools

Migration has minimal impact on a live cluster. Merge operations are temporarily delayed while the merge peers restart.

Singleton enforcement and resilience⚓︎

merge-controller runs as a singleton. On startup, it checks these conditions and refuses to bootstrap if either is true:

  • There are more than 0 merge-head pods currently running
  • There are more than 0 merge-controller pods currently running.

Running more than one merge service at once won't damage the cluster or your data, but it's inefficient and may cause pods to enter a CrashLoopBackoff state, breaking merge functionality. merge-controller ensures only one merge service runs at a time.

merge-controller is resilient to normal process terminations, such as manual restarts and scale changes, and to abnormal terminations, such as pod evictions and OOM kills. merge-peer pods continue any merge operations in progress when merge-controller terminates, but don't receive new work until merge-controller returns. merge-peer pods repeatedly try to reconnect to merge-controller. After reconnecting, merge peers report their current status so merge-controller can reconstruct its state.

Memory and performance⚓︎

While merge-controller uses more memory than merge-head to maintain an instantaneously accurate view of the entire merge subsystem, the cluster's overall CPU and storage access demands decrease with merge-controller.

Merging partitions is a bin-packing exercise. The goal is to combine sets of smaller partitions with varying sizes into the fewest number of larger partitions without exceeding fixed size limits. The bin-packing algorithm in merge-controller, using an in-memory approach and a first-fit or best-fit strategy, significantly improves efficiency over merge-head. Removing the intermediate queue and using direct gRPC connections between merge peers and merge-controller also improve performance.

In general, merge-controller uses more memory when:

  • A large number of active merges
  • Many tables with high ingest volume. Higher ingest creates more partitions, which increases merge counts.

The impact of these variables is typically small but may be detectable.

Performance tunables⚓︎

The merge controller determines which existing partitions to combine into new, better-organized partitions. These lists of partitions are called candidates. Use these tunables to control resources for building candidates.

merge_max_candidates limits the number of candidates awaiting dispatch to merge peers. If set too low relative to the number of merge peers in a pool, throughput may drop. Higher values can improve performance but increase memory usage. Hydrolix recommends setting this to 500.

merge_max_partitions_per_candidate limits the number of partitions merged together in a single operation. Higher values allow more partitions per candidate, but may impact turbine's merge capacity.

Increasing these tunables can improve merge throughput at the cost of higher memory usage.

Example of increased tunable limits for merge-controller
1
2
3
spec:
  merge_max_candidates: 500
  merge_max_partitions_per_candidate: 1024

Since higher values increase memory usage, monitor pod restarts for OOM kills and watch for usage approaching pod limits.

Metric candidates and autoscaling⚓︎

The merge controller exposes a Prometheus gauge candidates counting the number of partition groups waiting to be dispatched to merge peers. The metric is labeled per project, table, pool, and target. The target label maps to the merge peer era, such as I, II, or III.

Construction limit⚓︎

The merge controller limits memory usage by pausing construction of new candidates once its buffer reaches merge_max_candidates. For this reason, the candidates metric plateaus at the limit of merge_max_candidates. If the HDX Autoscaler scales merge peers based on this metric, it can't see demand beyond the plateau and won't provision additional pods. To set an alert that checks the candidates metric against merge_max_candidates, review Configure Alerts.

Hydrolix recommends setting merge_max_candidates to 500.

Spec Snippet for merge_max_candidates
spec:
  merge_max_candidates: 500

Higher candidate count increases merge controller memory usage.

Monitor for OOM kills after increasing this value.

If the candidates metric is persistently at merge_max_candidates, check the current replica count for the affected pool and apply the appropriate fix:

  • Replicas are below the configured maximum. Reduce target_value in the pool's hdxscalers configuration. A lower target value causes the autoscaler to request more replicas for the same metric value. For example, lowering target_value from 50 to 25 doubles the targeted replica count.
  • Replicas are already at the configured maximum. Increase max in the pool's hdxscalers configuration to give the autoscaler room to add pods.

Once the metric drops below the plateau, return target_value and replica limits to their normal range.

Tune the autoscaler aggregation⚓︎

The HDX Autoscaler can scale merge-peer replicas based on the candidates metric. Configure this in the hdxscalers section of a service or pool. Set the op parameter to control how multiple samples across tables and targets are aggregated. Supported values are sum, avg, min, and max. If op isn't set, the autoscaler uses only the first sample it encounters, which may undercount demand. For clusters with multiple high-volume tables, sum or max is best.

This example scales the merge-peer-iii pool between 1 and 10 replicas, targeting 50 candidates across all pods, with sum aggregating across all tables and targets.

Example Autoscaler Config for Merge Peer
spec:
  pools:
    merge-peer-iii:
      cpu: 1
      hdxscalers:
      - app: merge-controller
        metric: candidates
        metric_labels:
          pool: merge-peer-iii
        op: sum
        per_pod: false
        port: 27182
        min: 1
        max: 10
        target_value: 50

Use this query to monitor for saturation.

sum by (target, table_name) (candidates{app="merge-controller"})

OOM recovery and self-tuning⚓︎

Merge peer pods can be OOM-killed when a merge candidate requires more memory than the container limit allows. When this happens, the partitions from the failed merge become eligible again and are included in future candidates. merge-controller also tracks actual memory usage per merge and adjusts future estimates using an exponentially weighted moving average. Over time this reduces the likelihood of OOM kills for similar workloads. After a cluster first encounters OOM events for a given table, allow several merge cycles before intervening.

If a merge peer pool shows persistent OOM restarts that don't resolve after several cycles, the container memory limit likely needs to be increased. See Troubleshoot merge peer OOM for steps to identify which container is affected and how to apply the fix.

Scale horizontally or vertically⚓︎

When merge is falling behind, the right response depends on the symptom.

Scale horizontally by adding merge peer replicas when the candidates metric is consistently high or saturated. This means there's enough work for more pods to process in parallel.

Scale vertically by increasing memory per pod when individual merge peers are OOM-killed persistently. This means the partitions being merged are too large for the current memory limit. See Troubleshoot merge-peer OOM for steps to diagnose and resolve this.

Disable merge on tables⚓︎

All tables have merge enabled by default. Disable and re-enable merge with the PATCH table API endpoint. Disabling merge stops new merge jobs immediately, but jobs already in the queue complete.

Disable merge only under special circumstances

Disabling merge isn't recommended, and may result in performance degradation.

For example, this API request enables merge for a given table:

PATCH {{base_url}}orgs/{{org_id}}/projects/{{project_id}}/tables/{{table_id}}
Authorization: Bearer {{access_token}}
Content-Type: application/json
Accept: application/json

{
    "settings": {
        "merge": {
            "enabled": true
        }
    }
}

To disable merge in the Hydrolix UI, navigate to Data, select the table, find Merge settings under Advanced options, select the three dots on the right of that row, and select Disable Merge.

Displaying a table's merge settings in Data > table name > Advanced Settings in Hydrolix UI

Merge pools⚓︎

Hydrolix clusters create merge components in three pools: small, medium, and large. These three sizes each handle different partitions that are differentiated by several criteria. This ensures optimal partition sizing and spreads merge workloads across old and new data.

This table shows the criteria used to assign partitions to merge pools:

If the max Primary Timestamp is: ...and the size is within: ...and the time width is within: Resulting Merge Pool
Under 10 minutes old 1 GB 1 hour small (merge-i)
Between 10 minutes and 1 hour old 2 GB 1 hour medium (merge-ii)
Between 1 hour and 90 days old 4 GB 1 hour large (merge-iii)

For example, a partition with a last timestamp 15 minutes ago, a size of 513 MB, and a width of 37 minutes goes to the medium pool.

A 2.5 GB partition isn't eligible for merge until 1 hour after its last timestamp, and goes to the large pool only if other eligible partitions smaller than 1.5 GB exist to merge with.

Partitions older than 90 days aren't considered by default

The merge system looks back only 90 days for partitions eligible for compaction. This limit is configurable through merge_target_overrides.

📘 Primary timestamp For more information on primary timestamps, see Timestamp Data Types.

Custom merge pools⚓︎

To separate merge workloads and avoid “noisy neighbor” effects, create additional merge pools targeted at specific tables. For example, create a dedicated merge pool for a Summary Table to separate that workload from the main merge process.

Create custom merge pools with the pools API endpoint, then apply those pools to tables with the tables API endpoint.

Create pools⚓︎

This Config API command creates a custom pool using the pools API endpoint:

POST {{base_url}}pools/
Authorization: Bearer {{access_token}}
Content-Type: application/json

{
     "settings": {
          "is_default": false,
          "k8s_deployment": {
               "service": "merge-peer",
               "scale_profile": "II"
          }
     },
     "name": "my-pool-name-II"
}

In the Hydrolix UI, select Add new from the upper right-hand menu, then select Resource pool.

Showing new Resource pool dialog

Use these settings to configure your pool:

Object Description Value/Example
service The service workload the pool uses. For merge, this is merge-peer. merge-peer
scale_profile The merge pool size, corresponding to small, medium, or large. I, II or III
name The name used to identify your pool. Example: my-pool-name-II
cpu The amount of CPU provided to pods. A numeric value, defaults are specified in Scale Profiles. Example : 2
memory The amount of memory provided to pods. A string value, defaults are specified in Scale Profiles. Default units are Gi. Example:10Gi
replicas The number of pods to run in the pool. A numeric value or hyphenated range. Defaults are specified in Scale Profiles. Examples: 3 and 1-5
storage The amount of ephemeral storage provided to pods. A string value, defaults are specified in Scale Profiles. Default units are Gi. Example: 5Gi

Assign pools to tables⚓︎

This API request assigns custom pools to a table using the tables API endpoint:

PATCH {{base_url}}/orgs/{{org_uuid}}/projects/{{project_uuid}}/tables/{{table_uuid}}/
Authorization: Bearer {{access_token}}
Content-Type: application/json

{
    "name": "my-table",
    "settings": {
        "merge": {
            "enabled": true,
            "pools": {
                "large": "my-pool-name-III",
                "medium": "my-pool-name-II",
                "small": "my-pool-name-I"
            }
        }
    }
}

To configure this in the Hydrolix UI, navigate to Data, select the table, find Merge settings under Advanced options, and select the pool assignment menu:

Assigning pools in merge settings in Data > table name > Advanced Settings in Hydrolix UI

Use all three pools

For optimal merge performance, provide a large, medium, and small pool.

Troubleshoot merge-peer OOM⚓︎

As described in OOM recovery and self-tuning, sporadic OOM kills don't require action. Follow these steps when a merge peer pool shows persistent OOM restarts that don't resolve on their own.

Each merge-peer pod runs two main containers: the primary merge-peer container and a secondary merge-indexer container, which is the turbine sidecar that builds indexes for merged partitions. Each container has its own memory limit and they're configured independently.

Identify which container is being OOM-killed⚓︎

When a merge peer pod is OOM-killed, first determine which of the two containers hit its memory limit:

kubectl describe pod <pod-name> | grep -A 3 "Last State"

The output shows the terminated container name and reason. If turbine appears with Reason: OOMKilled, the merge indexer sidecar is the problem, not the primary merge-peer container.

You can also check for recent OOM events across all merge peer pods:

kubectl get events --field-selector reason=OOMKilling \
  | grep merge-peer

If the OOM-killed container is turbine, increase the merge indexer memory using spec.scale.profile.

If the OOM-killed container isn't turbine and matches the pool name such as merge-peer-iii, increase memory in the pool definition's memory field instead.

Default merge indexer memory by era⚓︎

Generation Profile Default merge-indexer memory Default CPU
Era I I 4 Gi 2
Era II II 6 Gi 2
Era III III 12 Gi 2

Fix: increase memory on the merge indexer container⚓︎

Override the merge indexer memory through spec.scale.profile. Don't use the pool definition's top-level memory field as that controls the primary merge peer container, not the indexer. The two containers resolve their resources independently.

Override Era III Merge-Indexer Memory in HydrolixCluster
1
2
3
4
5
6
scale:
  profile:
    III:
      merge-indexer:
        memory: 16Gi
        cpu: 2           

Apply the change:

kubectl apply -f hydrolixcluster.yaml

The operator detects the spec change, updates the deployment, and Kubernetes rolls the affected merge-peer pods with the new resources. No manual kubectl rollout restart is required.

See Custom Scale Profiles for an example of creating a named profile and attaching it to a pool.

Troubleshoot: useful queries⚓︎

Duration of merge (without upload to storage)⚓︎

max(merge_sdk_duration_summary{app="merge-peer-*", quantile="0.9"})

Merge controller latency in communicating with query catalog⚓︎

histogram_quantile(0.99, sum by(le, method) (rate(query_latency_bucket{app="merge-controller"}[$Resolution])))

Count of partitions tracked in memory⚓︎

sum by (instance) (tracked{app="merge-controller"})

Count of currently active merge operations⚓︎

sum by (target) (active_merges{app="merge-controller"})

Count of known partition segments⚓︎

sum by (target) ((segments{app="merge-controller"}))

Count of constructed candidates ready to be merged⚓︎

sum by (target) ((candidates{app="merge-controller"}))

Count of fetched partitions awaiting segmentation⚓︎

sum by (target) (partitions{app="merge-controller"})

Count of partitions sourced that are already tracked⚓︎

sum by(pool_id) (rate(duplicate_partitions{app="merge-controller"}[$Resolution]))

Count of connected clients⚓︎

sum by(pool_id) (connected_clients{app="merge-controller"})

Percentage of time merge-peers are performing work⚓︎

sum by (pool) (merge_duty_cycle{quantile="1"})

Merge peers upload duration⚓︎

max(upload_duration{quantile="0.5", service="merge-peer", app="merge-peer"})

Race lost counter⚓︎

1
2
3
4
5
SELECT count(*)
FROM "hydro"."logs"
WHERE ( app LIKE '%merge-peer%' or app LIKE '%merge-controller%')
AND error LIKE '%race lost%' and message like '%failed%'
AND ( timestamp >= $__fromTime AND timestamp <= $__toTime );

Merges completed⚓︎

1
2
3
4
5
6
7
SELECT count(*) as "Merges" FROM hydro.logs
WHERE ( timestamp between $__fromTime AND $__toTime )
AND app = 'merge-peer'
AND query_phase = 'end'
AND pool = 'merge-peer'
AND error IS NULL
AND exception IS NULL

Merges completed with failures⚓︎

1
2
3
4
5
6
7
SELECT count(*) as "Merges" FROM hydro.logs
WHERE ( timestamp between $__fromTime AND $__toTime )
AND app = 'merge-peer'
AND query_phase = 'end'
AND pool = 'merge-peer'
AND error IS NOT NULL
AND exception IS NOT NULL

Actual partition count by project and table⚓︎

sum by (target, project_name, table_name) (actual_partition_count{project_name="$Project", table_name="$Table"}) 

Ideal partition count by project and table⚓︎

sum by (target, project_name, table_name) (ideal_partition_count{project_name="$Project", table_name="$Table"}) 

Merge efficiency by project and table⚓︎

sum by (target, project_name, table_name) (efficiency{project_name="$Project", table_name="$Table"})

Partition memory size by project and table, histogram⚓︎

histogram_quantile(0.99, sum by(le) ((partition_distribution_bucket{project_name="$Project", table_name="$Table"})))

Age of buckets in milliseconds, histogram⚓︎

histogram_quantile(0.99, sum by(le, target, basis) (bucket_duration_bucket{app="merge-controller"}))

Buckets closed per second⚓︎

sum by (target, basis) (rate(bucket_duration_count[$Resolution]))