OOMKill Detector and Data Splitter
Detect and mitigate out-of-memory scenarios in intake services.
Overview
Hydrolix intake services can sometimes exceed the memory limits of the indexer, causing two detectable out-of-memory scenarios:
- The indexer exits with status 137, signaling an OOMKill status within Kubernetes.
- The circuit breaker inside the indexer itself produces a specific
Code: 241
error.
Hydrolix aids customers who are seeing OOMKill statuses in the indexer: it provides optional functionality to detect these scenarios and retry indexing with multiple, smaller, amounts of data to reduce the indexing service’s memory usage.
The OOMKill Detector and Data Splitter is disabled by default, and is available for all Hydrolix real time ingestion services, including intake-head
, kinesis-peer
, kafka-peer
, and akamai-siem-peer
.
Configuration
Use the oom_detection
and related tunables in your hydrolixcluster.yaml to configure this behavior. The detection is enabled for services individually, and the two types of scenarios can be enabled/disabled individually via four nested settings. For example:
oom_detection:
intake-head:
k8s_oom_kill_detection_enabled: true
k8s_oom_kill_detection_max_attempts: 5
circuit_break_oom_detection_enabled: true
preemptive_splitting_enabled: true
kinesis-peer:
[same keys supported as above]
Here are definitions of each of the four settings:
Setting | Default Value | Definition |
---|---|---|
k8s_oom_kill_detection_enabled | false | Set to true to enable detection of Kubernetes OOMKills. |
k8s_oom_kill_detection_max_attempts | 5 | Sets the maximum number of iterations to poll the Kubernetes API for failed containers with OOMKill status. |
circuit_break_oom_detection_enabled | false | Set to true to enable detection of the circuit breaker OOM response from turbine. |
preemptive_splitting_enabled | false | Set to true to enable preemptively splitting data before calling indexer based on previous OOM detections. |
Regarding the k8s_oom_kill_detection_max_attempts
setting: increase it if you see OOMKills that aren't detected by the intake service. Decrease it if you'd rather not have the detector spend as much time waiting for the Kubernetes API to report OOMKills. The intake services listed above will issue this log line when detecting an OOM and retrying with smaller amounts of data:
turbine oom detected attempting split
Upgrading from v4.20 or earlier?
A new Kubernetes service account called
ingest
is provided to access the Kubernetes API. If you’re running Hydrolix in a cloud that has IAM (for example, GCP and AWS), thisingest
Kubernetes service account needs to be bound to the cloud service account used for the cluster. See the v4.21 release notes for steps to follow.
Limitations
If the indexing request is for loading data from the raw data spill feature. Hydrolix won't split and retry data.
Updated 1 day ago