The Hydrolix vacuums and partition cleaner delete unneeded and redundant data from your cluster to free up storage space. The partition cleaner removes partitions, and the vacuums delete logs and rejects.

The default partition cleaning schedule changed from daily to weekly in v5.6.

Partition cleaner

The partition cleaner deletes partitions no longer referenced by the Hydrolix catalog. Each partition corresponds to a particular period of timestamped data. When Hydrolix optimizes storage usage, it sometimes merges data from multiple partitions or time periods into a single partition. During optimization, Hydrolix doesn't automatically delete unused partitions or their data. Instead, the catalog removes references or pointers to the unused partitions.

The partition cleaner deletes partitions that have been rendered redundant for at least 24 hours. By default, partition vacuum runs weekly on Monday at 12:00 AM UTC, unless otherwise specified with the partition_cleaner_schedule tunable.

The partition cleaner is a Kubernetes CronJob resource that cleans up object storage, looking for partitions that exist in object storage, but not in the catalog.

To configure the partition cleaner in the hydrolixcluster.yaml, search the Configuration Options Reference page for settings that begin with partition_cleaner.

Log vacuum

Hydrolix logs data regarding various components, user interactions, and events that occur within your cluster. Different clusters store different kinds of logs for different lengths of time, based on the needs of your application. By default, Hydrolix deletes all logs after seven days. You can configure this using the log_vacuum_max_age tunable.

The log vacuum deletes logs older than your cluster's configured maximum log age (in days). By default, log vacuum runs nightly unless otherwise specified with the log_vacuum_schedule tunable.

Rejects vacuum

Hydrolix can selectively ignore ingested data that meets custom criteria. Hydrolix calls these ignored rows rejects. When Hydrolix ignores an ingested row, it records that reject as a JSON object in a reject file.

The reject file includes all the ingested data, the originating project and table, and the reason for the rejection. Over time, clusters that reject large quantities of data can accumulate large volumes of reject files. To keep reject files from consuming an ever-growing amount of space in your cluster, we introduced the rejects vacuum, which cleans up old, no-longer-needed reject files.

By default, the rejects vacuum deletes rejects older than 7 days. You can configure this using the rejects_vacuum_max_age tunable. By default, rejects vacuum runs nightly unless otherwise specified with the rejects_vacuum_schedule tunable.

Vacuum metrics

The partition-cleaner now exports metrics to monitor vacuum behavior:

Metric NamePurposeExample
bulk_delete_bytesHow much data was deletedbulk_delete_bytes{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 1.520361e+06
bulk_delete_durationHow long deletes take (percentiles)bulk_delete_duration{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary",quantile="0.99"} 466
bulk_delete_duration_sumTotal time spent on deletesbulk_delete_duration_sum{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 603
bulk_delete_duration_countNumber of delete operations measuredbulk_delete_duration_count{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 24
bulk_delete_success_countNumber of deletes that workedbulk_delete_success_count{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 33
bulk_delete_failure_countNumber of deletes that failedbulk_delete_failure_count{app="partition-cleaner",bucket="hdx-test-test",storage="hdx_primary"} 0