Cluster Health Status
The Hydrolix Operator publishes cluster health information to the HydrolixCluster custom resource. Cluster health uses the following indicators:
The cluster issues are a list of unhealthy Kubernetes resources managed by the Operator and a brief description of each problem.
Cluster status⚓︎
The cluster status reflects the overall health of the Hydrolix deployment. See the status using kubectl get:
To retrieve the status field only:
Kubectl retrieves this from the status.clusterStatus field of the HydrolixCluster custom resource.
Status values⚓︎
The cluster status can be one of the following values:
| Status | Description |
|---|---|
| Ready | No critical issues are present. |
| Not Ready | At least one critical issue is present. |
| Upgrading | The cluster is deploying a different version. |
| Scaled Off | The cluster has been scaled off (scale_off: true). |
Cluster issues⚓︎
Cluster issues represent unhealthy Kubernetes resources managed by the Hydrolix Operator. The Operator determines resource health based on factors specific to each resource type.
Issue types⚓︎
The Operator reports two types of issues:
- Critical: Issues that prevent the cluster from reaching
Readystatus. These are associated with resources that the Operator considers essential. - Non-critical: Issues associated with resources that the Operator ignores during health evaluation. Non-critical issues don't prevent the cluster from reaching
Readystatus.
View issues⚓︎
Check the .status.issues field to see both critical and non-critical issues:
The relationship between issues and cluster status follows this logic:
- If one or more critical issues exist,
clusterStatusisNot Ready. - If only non-critical issues exist,
clusterStatusisReady. - If no issues exist,
clusterStatusisReady.
Conditions⚓︎
The Operator derives issues from conditions. Kubernetes resources report their conditions through the status.conditions field. Each condition includes:
- A
type(such asAvailableorProgressing) - A
status(True,False, orUnknown) - A human-readable
messageexplaining the current state.
The Operator inspects these conditions to assess health. For example, a deployment is considered unhealthy when its Available condition has a status of False.
The following example shows a healthy Deployment with both conditions reporting True:
Configure ignored resources⚓︎
Resources can be excluded from the cluster health evaluation by marking them as ignored. Ignored resources are treated as non-critical, meaning their issues won't prevent the cluster from reaching Ready status. This is useful for optional components or maintenance tasks that shouldn't affect overall cluster health.
Tunables⚓︎
Three tunables control which resources the Operator ignores during health checks.
| Name | Type | Default | Description |
|---|---|---|---|
health_check_default_ignored_resources |
list | ["Job/load-sample-project.*"] |
The default list of resource patterns to ignore during health checks. These are the resources that the operator considers non-critical by default. Patterns support regular expression matching. |
health_check_ignored_resources |
list | [] |
Additional resource patterns to ignore. By default, these patterns are merged with the default ignored resources. |
health_check_override_default_ignored_resources |
bool | false |
Controls whether user-specified patterns replace or extend the defaults. When set to false (the default), user patterns are combined with defaults. When set to true, user patterns replace defaults entirely. |
Examples⚓︎
Add custom non-critical resources⚓︎
To add custom patterns while keeping the defaults, specify health_check_ignored_resources:
| Add Custom Non-Critical Resources | |
|---|---|
Issues with both the load-sample-project job and query-head deployment are categorized as non-critical. The issues output reflects this:
| Example Issues With Custom Non-Critical Resources | |
|---|---|
Override defaults completely⚓︎
To ignore only specific resources and discard the defaults, set health_check_override_default_ignored_resources to true and list ignored resources with health_check_ignored_resources:
| Override Default Ignored Resources | |
|---|---|
Only the specified patterns are ignored. The usual default Job/load-sample-project.* pattern is treated as critical.
Remove all ignored resources⚓︎
To treat all unhealthy resources as critical, override the defaults with an empty list:
| Remove All Ignored Resources | |
|---|---|
All unhealthy resources are critical and prevent Ready status.