OOMKill Detector and Data Splitter

Detect and mitigate out-of-memory scenarios in intake services.

Overview

Hydrolix intake services can sometimes exceed the memory limits of the indexer, causing two detectable out-of-memory scenarios:

  1. The indexer exits with status 137, signaling an OOMKill status within Kubernetes.
  2. The circuit breaker inside the indexer itself produces a specific Code: 241 error.

Hydrolix aids customers who are seeing OOMKill statuses in the indexer: it provides optional functionality to detect these scenarios and retry indexing with multiple, smaller, amounts of data to reduce the indexing service’s memory usage.

The OOMKill Detector and Data Splitter is disabled by default, and is available for all Hydrolix real time ingestion services, including intake-head, kinesis-peer, kafka-peer, and akamai-siem-peer.

Configuration

Use the oom_detection and related tunables in your hydrolixcluster.yaml to configure this behavior. The detection is enabled for services individually, and the two types of scenarios can be enabled/disabled individually via four nested settings. For example:

oom_detection:
  intake-head:
    k8s_oom_kill_detection_enabled: true
    k8s_oom_kill_detection_max_attempts: 5
    circuit_break_oom_detection_enabled: true
    preemptive_splitting_enabled: true
  kinesis-peer:
    [same keys supported as above]

Here are definitions of each of the four settings:

SettingDefault ValueDefinition
k8s_oom_kill_detection_enabledfalseSet to true to enable detection of Kubernetes OOMKills.
k8s_oom_kill_detection_max_attempts5Sets the maximum number of iterations to poll the Kubernetes API for failed containers with OOMKill status.
circuit_break_oom_detection_enabledfalseSet to true to enable detection of the circuit breaker OOM response from turbine.
preemptive_splitting_enabledfalseSet to true to enable preemptively splitting data before calling indexer based on previous OOM detections.

Regarding the k8s_oom_kill_detection_max_attempts setting: increase it if you see OOMKills that aren't detected by the intake service. Decrease it if you'd rather not have the detector spend as much time waiting for the Kubernetes API to report OOMKills. The intake services listed above will issue this log line when detecting an OOM and retrying with smaller amounts of data:

turbine oom detected attempting split

📘

Upgrading from v4.20 or earlier?

A new Kubernetes service account called ingest is provided to access the Kubernetes API. If you’re running Hydrolix in a cloud that has IAM (for example, GCP and AWS), this ingest Kubernetes service account needs to be bound to the cloud service account used for the cluster. See the v4.21 release notes for steps to follow.

Limitations

If the indexing request is for loading data from the raw data spill feature. Hydrolix won't split and retry data.