Vacuum
Hydrolix's vacuum feature deletes unneeded and redundant data from your cluster to free up storage space. Vacuum cleans up three different kinds of unneeded data: partitions, logs, and rejects.
Partition Vacuum
The partition vacuum deletes partitions no longer referenced by the Hydrolix catalog. Each partition corresponds to a particular period of timestamped data. When Hydrolix optimizes storage usage, it sometimes merges data from multiple partitions (or time periods) into a single partition. During optimization, Hydrolix doesn't automatically delete unused partitions or the data stored within. Instead, the catalog removes references (or pointers) to the unused partitions. Vacuum cleans up those unused partitions.
The partition vacuum deletes empty partitions that have been rendered redundant for at least 24 hours. By default, partition vacuum runs nightly at 1AM cluster local time, unless otherwise specified with the
partition_vacuum_schedule
tunable.
To be more specific, Partition Vacuum is a Kubernetes CronJob resource that cleans up the catalog as well as object storage, looking for these three forms of corrupt or malformed entries:
- Partitions that exist in the catalog, but not on object storage.
- Partitions that exist on object storage, but not in the catalog.
- Partitions that exist in both the catalog and on object storage, but do not have all of the required files (index.hdx, manifest.hdx, and data.hdx).
The first of these forms happen normally during the course of operation. The second and third are symptoms of abnormal operation that need to be cleaned up to avoid needless cloud storage costs.
There are a handful of tunable options for Partition Vacuum that can be configured in hydrolixcluster.yaml. Look in the Configuration Options Reference for settings that begin with partition_vacuum
.
Log Vacuum
Hydrolix logs data regarding various components, user interactions, and events that occur within your cluster. Different clusters store different kinds of logs for different lengths of time, based on the needs of your application. By default, Hydrolix deletes all logs after 7 days. You can configure this using the log_vacuum_max_age
tunable.
The log vacuum deletes logs older than your cluster's configured maximum log age (in days). By default, log vacuum runs nightly at 4AM cluster local time, unless otherwise specified with the log_vacuum_schedule
tunable.
Rejects Vacuum
Hydrolix can selectively ignore ingested data that meets custom criteria. Within Hydrolix, we call these ignored rows rejects. Whenever Hydrolix ignores an ingested row, it records that reject as a JSON object in a reject file. The reject file includes the full piece of ingested data, the originating project and table, and the reason for the rejection. Over time, clusters that reject large quantities of data can accumulate large volumes of reject files. To keep reject files from consuming an ever-growing amount of space in your cluster, we introduced the reject vacuum, which cleans up old, no-longer-needed reject files.
By default, the rejects vacuum deletes rejects older than 7 days. You can configure this using the rejects_vacuum_max_age
tunable. By default, rejects vacuum runs nightly at 12AM cluster local time, unless otherwise specified with the rejects_vacuum_schedule
tunable.
Updated about 2 months ago