Stream Settings
When using a streaming methods to ingest data, Hydrolix provides customizations to control how and when data is written into tables. These settings are used by the following stream ingestion methods
- HTTP Stream API: push-based, supports optional table access tokens
- Kafka: pull-based
- AWS Kinesis: pull-based
Configure How and When Data is Written⚓︎
Streaming event data contains two types of data:
- Hot data that needs to be made available as soon as possible.
- Cold data that arrives late or out of order.
A good example of this is CDN or appliance logs where 95% of logs are provided within a 15 minutes window with the last 5% being supplied later in a 24 hour period. (Don't confuse this with "hot" storage vs. "cold" storage, a different concept.)
Settings for hot and cold data can be independently configured and cover a variety of settings that determine how often event partitions are flushed to storage and how many partitions are being processed at any one time.
Defaults
We recommend trying out the system defaults before adjusting any of these settings. The default settings meet the majority of customer use cases.
Configure Data Write Settings⚓︎
The settings for how data is written to the table are configured either through the portal and under the sources menu or through the Tables endpoints in the Config API.
To effectively manage incoming stream data, Hydrolix Streaming has configurable options for Hot and Cold data.
- Hot data is near-term data defined as an event received within the
hot_data_max_age_minuteswindow fromnow(). - Cold data is defined as data late-arriving event data beyond the
hot_data_max_age_minutesbut before thecold_data_max_age_days. - Data beyond the
cold_data_max_age_days, is rejected and ignored.
Hot Settings⚓︎
The following settings define Hot data behavior:
| Element | Description | Default |
|---|---|---|
| hot_data_max_age_minutes | How long data is determined to be Hot from now + hot_data_max_age_minutes. Incoming events have their primary (datetime) column inspected and evaluated with older events than this considered too old to be hot. |
60 |
| hot_data_max_active_partitions | The maximum number of partitions to keep open on the server at any one time. | 12 |
| hot_data_max_rows_per_partition | The maximum size (measured in number of rows) to allow any open partition to reach. | 1048576 |
| hot_data_max_minutes_per_partition | The maximum timespan when writing a partition. This is the maximum allowable minutes between the newest and oldest primary timestamp of rows in the partition. | 5 |
| hot_data_max_open_seconds | The maximum duration (in wall clock time) to wait for events to trickle in for a recent-data partition. | 20 |
| hot_data_max_idle_seconds | The maximum duration to wait from the last received event before automatically closing an open partition. | 10 |
Partition timespan constraint
The hot_data_max_minutes_per_partition value must be a factor of 60. Valid values are: 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30, or 60.
This constraint ensures partitions remain within the same hour boundary. Partitions that span multiple hours can't be merged. For example, an invalid value like 55 would cause partitions to be ignored by the merge service.
The Config API enforces this constraint and rejects invalid values.
Cold Settings⚓︎
The following settings define Cold data behavior:
| Element | Description | Default |
|---|---|---|
| cold_data_max_age_days | How long data is determined to be Cold from now + hot_data_max_age_minutes. Incoming events have the primary (datetime) column inspected and evaluated with older events than this considered too old to be worth indexing at all and will be consigned to the scrap heap of history (well Rejects ). |
365 |
| cold_data_max_active_partitions | The maximum number of partitions to keep open at any one time. | 168 |
| cold_data_max_rows_per_partition | The maximum size (measured in number of rows) to allow any open partition to reach. | 1048576 |
| cold_data_max_minutes_per_partition | The maximum width in time of a partition. This is the maximum allowable distance between the newest and oldest primary of rows in the partition. | 60 |
| cold_data_max_open_seconds | The maximum duration (in wall clock time) to wait for events to trickle in for a recent-data partition. | 60 |
| cold_data_max_idle_seconds | The maximum duration to wait for new data to appear at all before automatically closing an open partition. | 30 |
Other Settings⚓︎
The following settings define all data behavior, hot and cold:
| Element | Description | Default |
|---|---|---|
| message_queue_max_rows | The maximum number of rows to pass into the internal message queue in a single message when Hydrolix receives data. The internal message queue can handle a maximum message size of 1MB. This setting helps you keep messages below that size. | 500 |
| sample_rate | The sampling rate of Hydrolix's ingest tier. This setting will ingest only a certain fraction of incoming data for this particular table. For example, a value of 0.4keeps 40% of your data and discards the rest. A value of 0 or less will keep all data, as will values of 1 or greater. |
1 |
Configure through the API⚓︎
Configure table settings in the Config API using the Tables endpoints.
The following is an example:
Interdependence with merge⚓︎
More information on merge can be found here. It should be noted that the volume of partitions written directly by the stream affects the merge service, giving it more or fewer partitions to merge later.
If you are migrating a large amount of data, a higher partition count during your initial loading process may require higher counts of merge servers and/or larger merge servers. When a higher number of partitions is being created during the initial ingest stream, it's important to ensure there is a sufficient amount of merge peers created to merge the resultant data.
See also⚓︎
- How Hydrolix Handles Late-Arriving Data - How hot and cold data settings control the acceptance window for out-of-order events
- Troubleshooting Late-Arriving Data in Hydrolix - Common causes and solutions when late-arriving data is rejected or missing