Decay and Reaper

Data lifecycle management operates as a two-step process involving two services:

Decay identifies partitions containing data that has exceeded the configured retention period and marks them as "deactivated" in the Hydrolix catalog. Once deactivated, these partitions are hidden from new queries but remain available to complete any in-flight queries. The decay service runs continuously, checking partition ages based on the primary timestamp of the data itself.
Reaper permanently deletes partitions that have been deactivated for a specified grace period. It removes these partitions from both the catalog and object storage. The reaper service is also used by other Hydrolix services like merge-cleanup to handle partition deletion.

Configure data retention policies at the table level to manage how long data should remain alive before Decay and Reaper delete it.

Decay⚓︎

Decay is a cronjob that identifies partitions that contain old data and marks them for deletion based on configured data retention policies. It prevents unbounded storage consumption by deactivating partitions that exceed configured age thresholds, and sending deletion tasks to the Reaper service for physical removal from the catalog and object storage.

What Decay does⚓︎

Partition deactivation⚓︎

Examines all tables with age.max_age_days configured (tables with value 0 are skipped)
Calculates cutoff time as current_time - age.max_age_days
Identifies active partitions where both max_timestamp and min_timestamp are older than the cutoff
Marks matching partitions as inactive in the Catalog, which makes them unavailable to future queries
Sets deactivation_reason to decay in partition metadata for audit purposes

Important notes:

Age is calculated relative to the partition's primary datetime values, not the ingestion time.
Decay only processes unlocked partitions. It skips partitions that are locked by Merge.
Deactivated partitions continue serving any in-flight queries that were active at deactivation time.

Reaper queue management⚓︎

Examines all tables with reaper.max_age_days configured (tables with value 0 are skipped)
Calculates cutoff time as current_time - reaper.max_age_days
Identifies inactive, unlocked partitions where the modified timestamp (deactivation time) is older than the cutoff
Creates ReapEvent objects for each partition and sends them to the Reaper queue in RabbitMQ

Important notes:

Reaper age is calculated from the deactivation date set by Decay, not the original data timestamp.

Configuration⚓︎

Cluster-Level Settings (Kubernetes tunables):

Parameter	Description	Default
`decay_enabled`	Enable/disable the Decay CronJob	true
`decay_schedule`	CRON schedule (daily at midnight UTC)	`0 0 * * *`
`decay_batch_size`	Entries fetched per catalog request for deactivation phase	5000
`decay_max_deactivate_iterations`	Max deactivation loops per table. Default is unlimited	None
`decay_reap_batch_size`	Entries fetched per catalog request for reaping phase	5000
`decay_max_reap_iterations`	Max reaping loops per table. Default is unlimited	None

Environment Variables:

BATCH_SIZE - Maps to decay_batch_size
REAP_BATCH_SIZE - Maps to decay_reap_batch_size
DECAY_MAX_ITERATIONS - Maps to decay_max_deactivate_iterations
REAP_MAX_ITERATIONS - Maps to decay_max_reap_iterations
REAPER_QUEUE - Queue ARN for reaper jobs

Table-Level Settings:

Parameter	Description	Default
`age.max_age_days`	Days data remains active/queryable before deactivation	0 (indefinite)

Review Data Retention Policies for details on how to configure Time-to-Live (TTL) policies.

Manual execution⚓︎

Run Decay manually by creating a job with kubectl.

Kubectl command to run Decay
kubectl create job --from=cronjob/decay {decay-manual-job-name} -n $HDX_KUBERNETES_NAMESPACE

In K9s, use command :cronjobs, select your decay job and press <t> for trigger.

The decay pod relating to the manual trigger will appear in the list of pods.

Metrics⚓︎

Decay does not have dedicated Prometheus metrics. Its database operations are tracked with database metrics labelled method="deactivate_entries" and db="catalog".

Metric Name	Type	Description
`query_latency_summary`	Summary	Latency summary for deactivating partitions (p50, p90, p99)
`query_latency_histo`	Histogram	Latency histogram for deactivating partitions
`query_count`	Counter	Count of deactivation queries executed
`query_failure`	Counter	Count of failed deactivation queries

Reaper⚓︎

The Reaper service deletes partitions that have been marked for deletion by Decay. It's also used by other services like Merge-Cleanup to handle partition deletion. It runs as a separate, continuous service consuming the reaper message queue from RabbitMQ.

What Reaper does⚓︎

For each ReapEvent read from RabbitMQ, the service:

locates partition files on storage
verifies file count
deletes all files
removes catalog entry

Configuration⚓︎

Table-Level Settings:

Parameter	Description	Default
`reaper.max_age_days`	Days after deactivation before physical deletion	1

Review Data Retention Policies for details on how to configure Time-to-Live (TTL) policies.

How to check the Reaper queue⚓︎

Navigate to the RabbitMQ pod shell and run command

RabbitMQ command for queue checks
rabbitmqctl --node rabbit@rabbitmq-0.rabbit list_queues

Metrics⚓︎

Reaper has dedicated Prometheus metrics

Metric Name	Type	Labels	Description
`reaper_reap_duration`	Summary	`storage_id`, `project_id`, `table_id`	Duration in milliseconds of a reaper's reap cycle (p50, p90, p99)
`reaper_failure`	Counter	`storage_id`, `project_id`, `table_id`	Count of reaper failures

When Reaper deletes catalog entries, they are tracked with database metrics labelled method="delete_deactivated_entry" and db="catalog".

Metric Name	Type	Description
`query_latency_summary`	Summary	Latency for deleting catalog entries
`query_latency_histo`	Histogram	Latency histogram for catalog deletions
`query_count`	Counter	Count of deletion queries
`query_failure`	Counter	Count of failed deletions