Skip to content

Cluster Health Status

The Hydrolix Operator publishes cluster health information to the HydrolixCluster custom resource. Cluster health uses the following indicators:

The cluster issues are a list of unhealthy Kubernetes resources managed by the Operator and a brief description of each problem.

Cluster status⚓︎

The cluster status reflects the overall health of the Hydrolix deployment. See the status using kubectl get:

kubectl get hdx
NAME   STATUS   VERSION                     URL
hdx    Ready    v5.11.0   https://hostname.hydrolix.live

To retrieve the status field only:

kubectl get hdx hdx -ojson | jq .status.clusterStatus
"Ready"

Kubectl retrieves this from the status.clusterStatus field of the HydrolixCluster custom resource.

Status values⚓︎

The cluster status can be one of the following values:

Status Description
Ready No critical issues are present.
Not Ready At least one critical issue is present.
Upgrading The cluster is deploying a different version.
Scaled Off The cluster has been scaled off (scale_off: true).

Cluster issues⚓︎

Cluster issues represent unhealthy Kubernetes resources managed by the Hydrolix Operator. The Operator determines resource health based on factors specific to each resource type.

Issue types⚓︎

The Operator reports two types of issues:

  • Critical: Issues that prevent the cluster from reaching Ready status. These are associated with resources that the Operator considers essential.
  • Non-critical: Issues associated with resources that the Operator ignores during health evaluation. Non-critical issues don't prevent the cluster from reaching Ready status.

View issues⚓︎

Check the .status.issues field to see both critical and non-critical issues:

kubectl get hdx hdx -ojson | jq .status.issues
1
2
3
4
5
6
7
8
9
{
  "critical": [
    "Deployment/turbine-api: Deployment does not have minimum availability",
    "Service/traefik: missing load balancer"
  ],
  "nonCritical": [
    "Job/load-sample-project-2049: Job has 0 available replicas"
  ]
}

The relationship between issues and cluster status follows this logic:

  • If one or more critical issues exist, clusterStatus is Not Ready.
  • If only non-critical issues exist, clusterStatus is Ready.
  • If no issues exist, clusterStatus is Ready.

Conditions⚓︎

The Operator derives issues from conditions. Kubernetes resources report their conditions through the status.conditions field. Each condition includes:

  • A type (such as Available or Progressing)
  • A status (True, False, or Unknown)
  • A human-readable message explaining the current state.

The Operator inspects these conditions to assess health. For example, a deployment is considered unhealthy when its Available condition has a status of False.

The following example shows a healthy Deployment with both conditions reporting True:

kubectl get deploy intake-head -oyaml | yq .status.conditions
- lastTransitionTime: "2025-11-05T17:01:52Z"
  lastUpdateTime: "2025-11-05T17:01:52Z"
  message: Deployment has minimum availability.
  reason: MinimumReplicasAvailable
  status: "True"
  type: Available
- lastTransitionTime: "2025-10-24T19:06:17Z"
  lastUpdateTime: "2025-11-05T17:01:56Z"
  message: ReplicaSet "intake-head-66599f7c6c" has successfully progressed.
  reason: NewReplicaSetAvailable
  status: "True"
  type: Progressing

Configure ignored resources⚓︎

Resources can be excluded from the cluster health evaluation by marking them as ignored. Ignored resources are treated as non-critical, meaning their issues won't prevent the cluster from reaching Ready status. This is useful for optional components or maintenance tasks that shouldn't affect overall cluster health.

Tunables⚓︎

Three tunables control which resources the Operator ignores during health checks.

Name Type Default Description
health_check_default_ignored_resources list ["Job/load-sample-project.*"] The default list of resource patterns to ignore during health checks. These are the resources that the operator considers non-critical by default. Patterns support regular expression matching.
health_check_ignored_resources list [] Additional resource patterns to ignore. By default, these patterns are merged with the default ignored resources.
health_check_override_default_ignored_resources bool false Controls whether user-specified patterns replace or extend the defaults. When set to false (the default), user patterns are combined with defaults. When set to true, user patterns replace defaults entirely.

Examples⚓︎

Add custom non-critical resources⚓︎

To add custom patterns while keeping the defaults, specify health_check_ignored_resources:

Add Custom Non-Critical Resources
1
2
3
4
5
apiVersion: hydrolix.io/v1
kind: HydrolixCluster
spec:
  health_check_ignored_resources:
    - "Deployment/query-.*"

Issues with both the load-sample-project job and query-head deployment are categorized as non-critical. The issues output reflects this:

Example Issues With Custom Non-Critical Resources
1
2
3
4
5
6
7
{
  "critical": [],
  "nonCritical": [
    "Job/load-sample-project-2049: Job has 0 available replicas",
    "Deployment/query-head: Deployment does not have minimum availability"
  ]
}

Override defaults completely⚓︎

To ignore only specific resources and discard the defaults, set health_check_override_default_ignored_resources to true and list ignored resources with health_check_ignored_resources:

Override Default Ignored Resources
1
2
3
4
5
6
apiVersion: hydrolix.io/v1
kind: HydrolixCluster
spec:
  health_check_override_default_ignored_resources: true
  health_check_ignored_resources:
    - "Deployment/intake-.*"

Only the specified patterns are ignored. The usual default Job/load-sample-project.* pattern is treated as critical.

Remove all ignored resources⚓︎

To treat all unhealthy resources as critical, override the defaults with an empty list:

Remove All Ignored Resources
1
2
3
4
5
apiVersion: hydrolix.io/v1
kind: HydrolixCluster
spec:
  health_check_override_default_ignored_resources: true
  health_check_ignored_resources: []

All unhealthy resources are critical and prevent Ready status.