Vacuum
Hydrolix provides automated cleanup services to delete unneeded and redundant data from your cluster, freeing up storage space. These services run on scheduled intervals to maintain optimal cluster performance.
- Partition Cleaner: removes partitions that are no longer referenced in the Hydrolix catalog.
- Periodic Service: consolidates multiple cleanup operations including log deletion, reject file cleanup, and Keycloak backup management.
- Decay and Reaper: decay selects old data for reaper to delete based on configured retention policies. Review data lifecycle for information on these services.
The partition cleaner deletes unused partitions and runs weekly by default. To enable the partition cleaner, set the partition-cleaner replica to 1.
The default partition cleaning schedule changed from daily to weekly in v5.6.
The periodic service manages operational data cleanup tasks and runs daily scheduled jobs for three vacuums:
The periodic service operates only on the cluster's default storage. It does not support multi-bucket configurations as the logs, rejects and keycloak backups are saved in the cluster's default storage.
To enable the periodic service, set the periodic-service replica to 1.
Partition cleaner⚓︎
The partition cleaner deletes partitions no longer referenced by the Hydrolix catalog. Each partition corresponds to a particular period of timestamped data. When Hydrolix optimizes storage usage, it sometimes merges data from multiple partitions or time periods into a single partition. During optimization, Hydrolix doesn't automatically delete unused partitions or their data. Instead, the catalog removes references or pointers to the unused partitions.
The partition cleaner deletes partitions that have been rendered redundant for at least 24 hours. By default, partition cleaner runs weekly on Monday at 12:00 AM UTC.
Configuration⚓︎
The partition cleaner runs as a scheduled job and can be configured using environment variables on the partition-cleaner deployment. To enable the partition cleaner, set the partition-cleaner replica to 1.
Schedule configuration:
- Environment variable:
PARTITION_CLEANER_SCHEDULE - Format: UNIX cron notation
- Default: weekly on Monday at 12:00 AM UTC
- Example:
0 0 * * 1(default)
Dry run
- Environment variable:
PARTITION_CLEANER_DRY_RUN - Values:
true(default) orfalse
Grace period:
- Environment variable:
PARTITION_CLEANER_GRACE_PERIOD - Format: Duration string (
24h) - Default: 24 hours
Metrics⚓︎
The partition-cleaner exports metrics to monitor partition cleaner behavior:
| Metric Name | Purpose | Example |
|---|---|---|
bulk_delete_bytes |
How much data was deleted | bulk_delete_bytes{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 1.520361e+06 |
bulk_delete_duration |
How long deletes take (percentiles) | bulk_delete_duration{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary",quantile="0.99"} 466 |
bulk_delete_duration_sum |
Total time spent on deletes | bulk_delete_duration_sum{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 603 |
bulk_delete_duration_count |
Number of delete operations measured | bulk_delete_duration_count{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 24 |
bulk_delete_success_count |
Number of deletes that worked | bulk_delete_success_count{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 33 |
bulk_delete_failure_count |
Number of deletes that failed | bulk_delete_failure_count{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 0 |
Differences between decay and reaper and partition cleaner⚓︎
Both decay/reaper and partition cleaner delete partitions from object storage, but they serve different purposes. The decay service is part of the Hydrolix data lifecycle management system and selects old data for reaper to delete based on configured retention policies. The partition cleaner performs maintenance by removing orphaned partitions that exist in object storage but aren't referenced in the catalog due to system errors like catalog write failures or failed merge operations.
- Decay and reaper manage data retention. These services also delete data from the catalog.
- Partition cleaner maintains storage integrity. This service deletes data solely from object storage.
Log vacuum⚓︎
Hydrolix logs data regarding various components, user interactions, and events that occur within your cluster. Different clusters store different kinds of logs for different lengths of time, based on the needs of your application.
The log vacuum is part of the periodic service, which consolidates several maintenance operations into a single service. The log vacuum deletes log files from object storage that are older than the configured maximum age. By default, Hydrolix deletes all logs after 7 days.
Configuration⚓︎
The log vacuum runs as a scheduled job within the periodic service and can be configured using environment variables on the periodic-service deployment. To enable the periodic service, set the periodic-service replica to 1.
Schedule configuration:
- Environment variable:
LOG_CLEAN_SCHED - Format: Tokio-cron notation (see tokio-cron syntax)
- Default: Daily at 00:00 UTC
- Example:
0 0 * * * *(every day at midnight)
Enable/Disable:
- Environment variable:
LOG_CLEAN_ENABLED - Values:
true(default) orfalse
Max age configuration:
- Only available as a command-line override:
--log_vacuum_max_age <DURATION> - Format: Duration string (
168h,7d) - Default: 7 days (168 hours)
Manual execution⚓︎
Manually trigger the log vacuum by executing commands within the periodic-service pod. By default, manual execution runs in dry-run mode and shows what would be deleted without actually deleting anything.
| Manual Access to Log Vacuum | |
|---|---|
Available options:
--perform-delete- Execute deletions (without this flag, runs in dry-run mode)--log_vacuum_max_age <DURATION>- Override the default max age (168hor7d)--rpc-port <PORT>- RPC port (default: 9000)--rpc-host <HOST>- RPC host (default:http://localhost)-h, --help- Display help information
Metrics⚓︎
The periodic service exports metrics to monitor log vacuum behavior.
| Metric Name | Purpose |
|---|---|
log_deletes_total |
Total number of log files deleted |
log_visited_total |
Total number of log files examined |
log_failures_total |
Number of failed deletion attempts |
log_delete_size_total |
Total size in bytes of deleted log files |
Rejects vacuum⚓︎
Hydrolix can selectively ignore ingested data that meets custom criteria. Hydrolix calls these ignored rows rejects. When Hydrolix ignores an ingested row, it records that reject as a JSON object in a reject file.
The reject file includes all the ingested data, the originating project and table, and the reason for the rejection. Over time, clusters that reject large quantities of data can accumulate large volumes of reject files.
The rejects vacuum is part of the periodic service, which consolidates several maintenance operations into a single service. The rejects vacuum cleans up old, no-longer-needed reject files from object storage to prevent them from consuming an ever-growing amount of space.
Configuration⚓︎
The rejects vacuum runs as a scheduled job within the periodic service and can be configured using environment variables on the periodic-service deployment. To enable the periodic service, set the periodic-service replica to 1.
Schedule configuration:
- Environment variable:
REJECTS_CLEAN_SCHED - Format: Tokio-cron notation (see tokio-cron syntax)
- Default: Daily at 02:00 UTC
- Example:
0 0 2 * * *(every day at 2 AM)
Enable/Disable:
- Environment variable:
REJECTS_CLEAN_ENABLED - Values:
true(default) orfalse
Max age configuration:
- Only available as a command-line override:
--max-age-duration <DURATION> - Format: Duration string (
168h,7d) - Default: 7 days (168 hours)
Manual execution⚓︎
You can manually trigger the rejects vacuum by executing commands within the periodic-service pod. By default, manual execution runs in dry-run mode and shows what would be deleted without actually deleting anything.
Available options:
--perform-delete- Execute deletions (without this flag, runs in dry-run mode)--max-age-duration <DURATION>- Override the default max age (168hor7d)--rpc-port <PORT>- RPC port (default: 9000)--rpc-host <HOST>- RPC host (default:http://localhost)-h, --help- Display help information
Metrics⚓︎
The periodic service exports metrics to monitor rejects vacuum behavior.
| Metric Name | Purpose |
|---|---|
reject_deletes_total |
Total number of reject files deleted |
reject_visited_total |
Total number of reject files examined |
reject_failures_total |
Number of failed deletion attempts |
reject_delete_size_total |
Total size in bytes of deleted reject files |
Keycloak vacuum⚓︎
Hydrolix uses Keycloak for authentication and authorization services. The system automatically creates backup files of Keycloak data to ensure configuration and user data can be recovered if needed. Over time, these backup files can accumulate and consume storage space.
The Keycloak vacuum is part of the periodic service, which consolidates several maintenance operations into a single service. The Keycloak vacuum automatically deletes Keycloak backup files that are older than the configured retention period.
Configuration⚓︎
The Keycloak vacuum runs as a scheduled job within the periodic service and can be configured using environment variables on the periodic-service deployment. To enable the periodic service, set the periodic-service replica to 1.
Schedule configuration:
- Environment variable:
KEYCLOAK_CLEAN_SCHED - Format: Tokio-cron notation (see tokio-cron syntax)
- Default: Daily at 01:00 UTC
- Example:
0 0 1 * * *(every day at 1 AM)
Enable/Disable:
- Environment variable:
KEYCLOAK_CLEAN_ENABLED - Values:
true(default) orfalse
Max age configuration:
- Only available as a command-line override:
--max-age-duration <DURATION> - Format: Duration string (
168h,7d) - Default: 7 days (168 hours)
Manual execution⚓︎
You can manually trigger the Keycloak vacuum by executing commands within the periodic-service pod. By default, manual execution runs in dry-run mode and shows what would be deleted without actually deleting anything.
Available options:
--perform-delete- Execute deletions (without this flag, runs in dry-run mode)--max-age-duration <DURATION>- Override the default max age (168hor7d)--rpc-port <PORT>- RPC port (default: 9000)--rpc-host <HOST>- RPC host (default:http://localhost)-h, --help- Display help information
Metrics⚓︎
The periodic service exports metrics to monitor Keycloak vacuum behavior.
| Metric Name | Purpose |
|---|---|
keycloak_deletes_total |
Total number of Keycloak backup files deleted |
keycloak_visited_total |
Total number of Keycloak backup files examined |
keycloak_failures_total |
Number of failed deletion attempts |
keycloak_delete_size_total |
Total size in bytes of deleted Keycloak backup files |