Cluster Health Status

The Hydrolix Operator publishes cluster health information to the HydrolixCluster custom resource. Cluster health uses the following indicators:

Cluster status
Cluster issues

The cluster issues are a list of unhealthy Kubernetes resources managed by the Operator and a brief description of each problem.

Cluster status⚓︎

The cluster status reflects the overall health of the Hydrolix deployment. See the status using kubectl get:

View Cluster StatusExample Full Status Output

1	`kubectl get hdx`

1 2	`NAME STATUS VERSION URL hdx Ready v5.11.0 https://hostname.hydrolix.live`

To retrieve the status field only:

Retrieve Cluster Status FieldExample Status Field Output

kubectl get hdx hdx -ojson | jq .status.clusterStatus

"Ready"

Kubectl retrieves this from the status.clusterStatus field of the HydrolixCluster custom resource.

Status values⚓︎

The cluster status can be one of the following values:

Status	Description
Ready	No critical issues are present.
Not Ready	At least one critical issue is present.
Upgrading	The cluster is deploying a different version.
Scaled Off	The cluster has been scaled off (`scale_off: true`).

Cluster issues⚓︎

Cluster issues represent unhealthy Kubernetes resources managed by the Hydrolix Operator. The Operator determines resource health based on factors specific to each resource type.

Issue types⚓︎

The Operator reports two types of issues:

Critical: Issues that prevent the cluster from reaching Ready status. These are associated with resources that the Operator considers essential.
Non-critical: Issues associated with resources that the Operator ignores during health evaluation. Non-critical issues don't prevent the cluster from reaching Ready status.

View issues⚓︎

Check the .status.issues field to see both critical and non-critical issues:

View Cluster IssuesExample Issues Output

kubectl get hdx hdx -ojson | jq .status.issues

{
  "critical": [
    "Deployment/turbine-api: Deployment does not have minimum availability",
    "Service/traefik: missing load balancer"
  ],
  "nonCritical": [
    "Job/load-sample-project-2049: Job has 0 available replicas"
  ]
}

The relationship between issues and cluster status follows this logic:

If one or more critical issues exist, clusterStatus is Not Ready.
If only non-critical issues exist, clusterStatus is Ready.
If no issues exist, clusterStatus is Ready.

Conditions⚓︎

The Operator derives issues from conditions. Kubernetes resources report their conditions through the status.conditions field. Each condition includes:

A type (such as Available or Progressing)
A status (True, False, or Unknown)
A human-readable message explaining the current state.

The Operator inspects these conditions to assess health. For example, a deployment is considered unhealthy when its Available condition has a status of False.

The following example shows a healthy Deployment with both conditions reporting True:

Get Conditions for a healthy intake-head DeploymentExample Condition Output

1	`kubectl get deploy intake-head -oyaml \| yq .status.conditions`

- lastTransitionTime: "2025-11-05T17:01:52Z"
  lastUpdateTime: "2025-11-05T17:01:52Z"
  message: Deployment has minimum availability.
  reason: MinimumReplicasAvailable
  status: "True"
  type: Available
- lastTransitionTime: "2025-10-24T19:06:17Z"
  lastUpdateTime: "2025-11-05T17:01:56Z"
  message: ReplicaSet "intake-head-66599f7c6c" has successfully progressed.
  reason: NewReplicaSetAvailable
  status: "True"
  type: Progressing

Configure ignored resources⚓︎

Resources can be excluded from the cluster health evaluation by marking them as ignored. Ignored resources are treated as non-critical, meaning their issues won't prevent the cluster from reaching Ready status. This is useful for optional components or maintenance tasks that shouldn't affect overall cluster health.

Tunables⚓︎

Three tunables control which resources the Operator ignores during health checks.

Name	Type	Default	Description
`health_check_default_ignored_resources`	list	`["Job/load-sample-project.*"]`	The default list of resource patterns to ignore during health checks. These are the resources that the operator considers non-critical by default. Patterns support regular expression matching.
`health_check_ignored_resources`	list	`[]`	Additional resource patterns to ignore. By default, these patterns are merged with the default ignored resources.
`health_check_override_default_ignored_resources`	bool	`false`	Controls whether user-specified patterns replace or extend the defaults. When set to `false` (the default), user patterns are combined with defaults. When set to `true`, user patterns replace defaults entirely.

Examples⚓︎

Add custom non-critical resources⚓︎

To add custom patterns while keeping the defaults, specify health_check_ignored_resources:

Add Custom Non-Critical Resources
apiVersion: hydrolix.io/v1
kind: HydrolixCluster
spec:
  health_check_ignored_resources:
    - "Deployment/query-.*"

Issues with both the load-sample-project job and query-head deployment are categorized as non-critical. The issues output reflects this:

Example Issues With Custom Non-Critical Resources
{
  "critical": [],
  "nonCritical": [
    "Job/load-sample-project-2049: Job has 0 available replicas",
    "Deployment/query-head: Deployment does not have minimum availability"
  ]
}

Override defaults completely⚓︎

To ignore only specific resources and discard the defaults, set health_check_override_default_ignored_resources to true and list ignored resources with health_check_ignored_resources:

Override Default Ignored Resources
apiVersion: hydrolix.io/v1
kind: HydrolixCluster
spec:
  health_check_override_default_ignored_resources: true
  health_check_ignored_resources:
    - "Deployment/intake-.*"

Only the specified patterns are ignored. The usual default Job/load-sample-project.* pattern is treated as critical.

Remove all ignored resources⚓︎

To treat all unhealthy resources as critical, override the defaults with an empty list:

Remove All Ignored Resources
apiVersion: hydrolix.io/v1
kind: HydrolixCluster
spec:
  health_check_override_default_ignored_resources: true
  health_check_ignored_resources: []

All unhealthy resources are critical and prevent Ready status.