Skip to content

Vacuum

Hydrolix provides automated cleanup services to delete unneeded and redundant data from your cluster, freeing up storage space. These services run on scheduled intervals to maintain optimal cluster performance.

  • Partition Cleaner: removes partitions that are no longer referenced in the Hydrolix catalog.
  • Periodic Service: consolidates multiple cleanup operations including log deletion, reject file cleanup, and Keycloak backup management.
  • Decay and Reaper: decay selects old data for reaper to delete based on configured retention policies. Review data lifecycle for information on these services.

The partition cleaner deletes unused partitions and runs weekly by default. To enable the partition cleaner, set the partition-cleaner replica to 1.

The default partition cleaning schedule changed from daily to weekly in v5.6.

The periodic service manages operational data cleanup tasks and runs daily scheduled jobs for three vacuums:

The periodic service operates only on the cluster's default storage. It does not support multi-bucket configurations as the logs, rejects and keycloak backups are saved in the cluster's default storage.

To enable the periodic service, set the periodic-service replica to 1.

Partition cleaner⚓︎

The partition cleaner deletes partitions no longer referenced by the Hydrolix catalog. Each partition corresponds to a particular period of timestamped data. When Hydrolix optimizes storage usage, it sometimes merges data from multiple partitions or time periods into a single partition. During optimization, Hydrolix doesn't automatically delete unused partitions or their data. Instead, the catalog removes references or pointers to the unused partitions.

The partition cleaner deletes partitions that have been rendered redundant for at least 24 hours. By default, partition cleaner runs weekly on Monday at 12:00 AM UTC.

Configuration⚓︎

The partition cleaner runs as a scheduled job and can be configured using environment variables on the partition-cleaner deployment. To enable the partition cleaner, set the partition-cleaner replica to 1.

Schedule configuration:

  • Environment variable: PARTITION_CLEANER_SCHEDULE
  • Format: UNIX cron notation
  • Default: weekly on Monday at 12:00 AM UTC
  • Example: 0 0 * * 1 (default)

Dry run

  • Environment variable: PARTITION_CLEANER_DRY_RUN
  • Values: true (default) or false

Grace period:

  • Environment variable: PARTITION_CLEANER_GRACE_PERIOD
  • Format: Duration string (24h)
  • Default: 24 hours

Metrics⚓︎

The partition-cleaner exports metrics to monitor partition cleaner behavior:

Metric Name Purpose Example
bulk_delete_bytes How much data was deleted bulk_delete_bytes{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 1.520361e+06
bulk_delete_duration How long deletes take (percentiles) bulk_delete_duration{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary",quantile="0.99"} 466
bulk_delete_duration_sum Total time spent on deletes bulk_delete_duration_sum{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 603
bulk_delete_duration_count Number of delete operations measured bulk_delete_duration_count{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 24
bulk_delete_success_count Number of deletes that worked bulk_delete_success_count{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 33
bulk_delete_failure_count Number of deletes that failed bulk_delete_failure_count{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 0

Differences between decay and reaper and partition cleaner⚓︎

Both decay/reaper and partition cleaner delete partitions from object storage, but they serve different purposes. The decay service is part of the Hydrolix data lifecycle management system and selects old data for reaper to delete based on configured retention policies. The partition cleaner performs maintenance by removing orphaned partitions that exist in object storage but aren't referenced in the catalog due to system errors like catalog write failures or failed merge operations.

  • Decay and reaper manage data retention. These services also delete data from the catalog.
  • Partition cleaner maintains storage integrity. This service deletes data solely from object storage.

Log vacuum⚓︎

Hydrolix logs data regarding various components, user interactions, and events that occur within your cluster. Different clusters store different kinds of logs for different lengths of time, based on the needs of your application.

The log vacuum is part of the periodic service, which consolidates several maintenance operations into a single service. The log vacuum deletes log files from object storage that are older than the configured maximum age. By default, Hydrolix deletes all logs after 7 days.

Configuration⚓︎

The log vacuum runs as a scheduled job within the periodic service and can be configured using environment variables on the periodic-service deployment. To enable the periodic service, set the periodic-service replica to 1.

Schedule configuration:

  • Environment variable: LOG_CLEAN_SCHED
  • Format: Tokio-cron notation (see tokio-cron syntax)
  • Default: Daily at 00:00 UTC
  • Example: 0 0 * * * * (every day at midnight)

Enable/Disable:

  • Environment variable: LOG_CLEAN_ENABLED
  • Values: true (default) or false

Max age configuration:

  • Only available as a command-line override: --log_vacuum_max_age <DURATION>
  • Format: Duration string (168h, 7d)
  • Default: 7 days (168 hours)

Manual execution⚓︎

Manually trigger the log vacuum by executing commands within the periodic-service pod. By default, manual execution runs in dry-run mode and shows what would be deleted without actually deleting anything.

Manual Access to Log Vacuum
# Shell into the periodic-service pod
kubectl exec -it <periodic-service-pod> -n <namespace> -- /bin/bash

cd bin

# Usage
./periodic log-vacuum -h

# Dry-run
./periodic log-vacuum --log_vacuum_max_age "7d"

# Manually perform deletion
./periodic log-vacuum --perform-delete --log_vacuum_max_age "7d"

Available options:

  • --perform-delete - Execute deletions (without this flag, runs in dry-run mode)
  • --log_vacuum_max_age <DURATION> - Override the default max age (168h or 7d)
  • --rpc-port <PORT> - RPC port (default: 9000)
  • --rpc-host <HOST> - RPC host (default: http://localhost)
  • -h, --help - Display help information

Metrics⚓︎

The periodic service exports metrics to monitor log vacuum behavior.

Metric Name Purpose
log_deletes_total Total number of log files deleted
log_visited_total Total number of log files examined
log_failures_total Number of failed deletion attempts
log_delete_size_total Total size in bytes of deleted log files

Rejects vacuum⚓︎

Hydrolix can selectively ignore ingested data that meets custom criteria. Hydrolix calls these ignored rows rejects. When Hydrolix ignores an ingested row, it records that reject as a JSON object in a reject file.

The reject file includes all the ingested data, the originating project and table, and the reason for the rejection. Over time, clusters that reject large quantities of data can accumulate large volumes of reject files.

The rejects vacuum is part of the periodic service, which consolidates several maintenance operations into a single service. The rejects vacuum cleans up old, no-longer-needed reject files from object storage to prevent them from consuming an ever-growing amount of space.

Configuration⚓︎

The rejects vacuum runs as a scheduled job within the periodic service and can be configured using environment variables on the periodic-service deployment. To enable the periodic service, set the periodic-service replica to 1.

Schedule configuration:

  • Environment variable: REJECTS_CLEAN_SCHED
  • Format: Tokio-cron notation (see tokio-cron syntax)
  • Default: Daily at 02:00 UTC
  • Example: 0 0 2 * * * (every day at 2 AM)

Enable/Disable:

  • Environment variable: REJECTS_CLEAN_ENABLED
  • Values: true (default) or false

Max age configuration:

  • Only available as a command-line override: --max-age-duration <DURATION>
  • Format: Duration string (168h, 7d)
  • Default: 7 days (168 hours)

Manual execution⚓︎

You can manually trigger the rejects vacuum by executing commands within the periodic-service pod. By default, manual execution runs in dry-run mode and shows what would be deleted without actually deleting anything.

Manual Access to Rejects Vacuum
# Shell into the periodic-service pod
kubectl exec -it <periodic-service-pod> -n <namespace> -- /bin/bash

cd bin

# Usage
./periodic rejects-vacuum -h

# Dry-run
./periodic rejects-vacuum --max-age-duration "7d"

# Manually perform deletion
./periodic rejects-vacuum --perform-delete --max-age-duration "7d"

Available options:

  • --perform-delete - Execute deletions (without this flag, runs in dry-run mode)
  • --max-age-duration <DURATION> - Override the default max age (168h or 7d)
  • --rpc-port <PORT> - RPC port (default: 9000)
  • --rpc-host <HOST> - RPC host (default: http://localhost)
  • -h, --help - Display help information

Metrics⚓︎

The periodic service exports metrics to monitor rejects vacuum behavior.

Metric Name Purpose
reject_deletes_total Total number of reject files deleted
reject_visited_total Total number of reject files examined
reject_failures_total Number of failed deletion attempts
reject_delete_size_total Total size in bytes of deleted reject files

Keycloak vacuum⚓︎

Hydrolix uses Keycloak for authentication and authorization services. The system automatically creates backup files of Keycloak data to ensure configuration and user data can be recovered if needed. Over time, these backup files can accumulate and consume storage space.

The Keycloak vacuum is part of the periodic service, which consolidates several maintenance operations into a single service. The Keycloak vacuum automatically deletes Keycloak backup files that are older than the configured retention period.

Configuration⚓︎

The Keycloak vacuum runs as a scheduled job within the periodic service and can be configured using environment variables on the periodic-service deployment. To enable the periodic service, set the periodic-service replica to 1.

Schedule configuration:

  • Environment variable: KEYCLOAK_CLEAN_SCHED
  • Format: Tokio-cron notation (see tokio-cron syntax)
  • Default: Daily at 01:00 UTC
  • Example: 0 0 1 * * * (every day at 1 AM)

Enable/Disable:

  • Environment variable: KEYCLOAK_CLEAN_ENABLED
  • Values: true (default) or false

Max age configuration:

  • Only available as a command-line override: --max-age-duration <DURATION>
  • Format: Duration string (168h, 7d)
  • Default: 7 days (168 hours)

Manual execution⚓︎

You can manually trigger the Keycloak vacuum by executing commands within the periodic-service pod. By default, manual execution runs in dry-run mode and shows what would be deleted without actually deleting anything.

Manual Access to Keycloak Vacuum
# Shell into the periodic-service pod
kubectl exec -it <periodic-service-pod> -n <namespace> -- /bin/bash

cd bin

# Usage
./periodic key-cloak-vacuum -h

# Dry-run
./periodic key-cloak-vacuum --max-age-duration "7d"

# Manually perform deletion
./periodic key-cloak-vacuum --perform-delete --max-age-duration "7d"

Available options:

  • --perform-delete - Execute deletions (without this flag, runs in dry-run mode)
  • --max-age-duration <DURATION> - Override the default max age (168h or 7d)
  • --rpc-port <PORT> - RPC port (default: 9000)
  • --rpc-host <HOST> - RPC host (default: http://localhost)
  • -h, --help - Display help information

Metrics⚓︎

The periodic service exports metrics to monitor Keycloak vacuum behavior.

Metric Name Purpose
keycloak_deletes_total Total number of Keycloak backup files deleted
keycloak_visited_total Total number of Keycloak backup files examined
keycloak_failures_total Number of failed deletion attempts
keycloak_delete_size_total Total size in bytes of deleted Keycloak backup files